arkohut commited on
Commit
090e4a2
·
verified ·
1 Parent(s): de4f984

Upload 12 files

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,1254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ - mteb
7
+ - transformers
8
+ - transformers.js
9
+ inference: false
10
+ license: apache-2.0
11
+ language:
12
+ - en
13
+ - zh
14
+ model-index:
15
+ - name: jina-embeddings-v2-base-zh
16
+ results:
17
+ - task:
18
+ type: STS
19
+ dataset:
20
+ type: C-MTEB/AFQMC
21
+ name: MTEB AFQMC
22
+ config: default
23
+ split: validation
24
+ revision: None
25
+ metrics:
26
+ - type: cos_sim_pearson
27
+ value: 48.51403119231363
28
+ - type: cos_sim_spearman
29
+ value: 50.5928547846445
30
+ - type: euclidean_pearson
31
+ value: 48.750436310559074
32
+ - type: euclidean_spearman
33
+ value: 50.50950238691385
34
+ - type: manhattan_pearson
35
+ value: 48.7866189440328
36
+ - type: manhattan_spearman
37
+ value: 50.58692402017165
38
+ - task:
39
+ type: STS
40
+ dataset:
41
+ type: C-MTEB/ATEC
42
+ name: MTEB ATEC
43
+ config: default
44
+ split: test
45
+ revision: None
46
+ metrics:
47
+ - type: cos_sim_pearson
48
+ value: 50.25985700105725
49
+ - type: cos_sim_spearman
50
+ value: 51.28815934593989
51
+ - type: euclidean_pearson
52
+ value: 52.70329248799904
53
+ - type: euclidean_spearman
54
+ value: 50.94101139559258
55
+ - type: manhattan_pearson
56
+ value: 52.6647237400892
57
+ - type: manhattan_spearman
58
+ value: 50.922441325406176
59
+ - task:
60
+ type: Classification
61
+ dataset:
62
+ type: mteb/amazon_reviews_multi
63
+ name: MTEB AmazonReviewsClassification (zh)
64
+ config: zh
65
+ split: test
66
+ revision: 1399c76144fd37290681b995c656ef9b2e06e26d
67
+ metrics:
68
+ - type: accuracy
69
+ value: 34.944
70
+ - type: f1
71
+ value: 34.06478860660109
72
+ - task:
73
+ type: STS
74
+ dataset:
75
+ type: C-MTEB/BQ
76
+ name: MTEB BQ
77
+ config: default
78
+ split: test
79
+ revision: None
80
+ metrics:
81
+ - type: cos_sim_pearson
82
+ value: 65.15667035488342
83
+ - type: cos_sim_spearman
84
+ value: 66.07110142081
85
+ - type: euclidean_pearson
86
+ value: 60.447598102249714
87
+ - type: euclidean_spearman
88
+ value: 61.826575796578766
89
+ - type: manhattan_pearson
90
+ value: 60.39364279354984
91
+ - type: manhattan_spearman
92
+ value: 61.78743491223281
93
+ - task:
94
+ type: Clustering
95
+ dataset:
96
+ type: C-MTEB/CLSClusteringP2P
97
+ name: MTEB CLSClusteringP2P
98
+ config: default
99
+ split: test
100
+ revision: None
101
+ metrics:
102
+ - type: v_measure
103
+ value: 39.96714175391701
104
+ - task:
105
+ type: Clustering
106
+ dataset:
107
+ type: C-MTEB/CLSClusteringS2S
108
+ name: MTEB CLSClusteringS2S
109
+ config: default
110
+ split: test
111
+ revision: None
112
+ metrics:
113
+ - type: v_measure
114
+ value: 38.39863566717934
115
+ - task:
116
+ type: Reranking
117
+ dataset:
118
+ type: C-MTEB/CMedQAv1-reranking
119
+ name: MTEB CMedQAv1
120
+ config: default
121
+ split: test
122
+ revision: None
123
+ metrics:
124
+ - type: map
125
+ value: 83.63680381780644
126
+ - type: mrr
127
+ value: 86.16476190476192
128
+ - task:
129
+ type: Reranking
130
+ dataset:
131
+ type: C-MTEB/CMedQAv2-reranking
132
+ name: MTEB CMedQAv2
133
+ config: default
134
+ split: test
135
+ revision: None
136
+ metrics:
137
+ - type: map
138
+ value: 83.74350667859487
139
+ - type: mrr
140
+ value: 86.10388888888889
141
+ - task:
142
+ type: Retrieval
143
+ dataset:
144
+ type: C-MTEB/CmedqaRetrieval
145
+ name: MTEB CmedqaRetrieval
146
+ config: default
147
+ split: dev
148
+ revision: None
149
+ metrics:
150
+ - type: map_at_1
151
+ value: 22.072
152
+ - type: map_at_10
153
+ value: 32.942
154
+ - type: map_at_100
155
+ value: 34.768
156
+ - type: map_at_1000
157
+ value: 34.902
158
+ - type: map_at_3
159
+ value: 29.357
160
+ - type: map_at_5
161
+ value: 31.236000000000004
162
+ - type: mrr_at_1
163
+ value: 34.259
164
+ - type: mrr_at_10
165
+ value: 41.957
166
+ - type: mrr_at_100
167
+ value: 42.982
168
+ - type: mrr_at_1000
169
+ value: 43.042
170
+ - type: mrr_at_3
171
+ value: 39.722
172
+ - type: mrr_at_5
173
+ value: 40.898
174
+ - type: ndcg_at_1
175
+ value: 34.259
176
+ - type: ndcg_at_10
177
+ value: 39.153
178
+ - type: ndcg_at_100
179
+ value: 46.493
180
+ - type: ndcg_at_1000
181
+ value: 49.01
182
+ - type: ndcg_at_3
183
+ value: 34.636
184
+ - type: ndcg_at_5
185
+ value: 36.278
186
+ - type: precision_at_1
187
+ value: 34.259
188
+ - type: precision_at_10
189
+ value: 8.815000000000001
190
+ - type: precision_at_100
191
+ value: 1.474
192
+ - type: precision_at_1000
193
+ value: 0.179
194
+ - type: precision_at_3
195
+ value: 19.73
196
+ - type: precision_at_5
197
+ value: 14.174000000000001
198
+ - type: recall_at_1
199
+ value: 22.072
200
+ - type: recall_at_10
201
+ value: 48.484
202
+ - type: recall_at_100
203
+ value: 79.035
204
+ - type: recall_at_1000
205
+ value: 96.15
206
+ - type: recall_at_3
207
+ value: 34.607
208
+ - type: recall_at_5
209
+ value: 40.064
210
+ - task:
211
+ type: PairClassification
212
+ dataset:
213
+ type: C-MTEB/CMNLI
214
+ name: MTEB Cmnli
215
+ config: default
216
+ split: validation
217
+ revision: None
218
+ metrics:
219
+ - type: cos_sim_accuracy
220
+ value: 76.7047504509922
221
+ - type: cos_sim_ap
222
+ value: 85.26649874800871
223
+ - type: cos_sim_f1
224
+ value: 78.13528724646915
225
+ - type: cos_sim_precision
226
+ value: 71.57587548638132
227
+ - type: cos_sim_recall
228
+ value: 86.01823708206688
229
+ - type: dot_accuracy
230
+ value: 70.13830426939266
231
+ - type: dot_ap
232
+ value: 77.01510412382171
233
+ - type: dot_f1
234
+ value: 73.56710042713817
235
+ - type: dot_precision
236
+ value: 63.955094991364426
237
+ - type: dot_recall
238
+ value: 86.57937806873977
239
+ - type: euclidean_accuracy
240
+ value: 75.53818400481059
241
+ - type: euclidean_ap
242
+ value: 84.34668448241264
243
+ - type: euclidean_f1
244
+ value: 77.51741608613047
245
+ - type: euclidean_precision
246
+ value: 70.65614777756399
247
+ - type: euclidean_recall
248
+ value: 85.85457096095394
249
+ - type: manhattan_accuracy
250
+ value: 75.49007817197835
251
+ - type: manhattan_ap
252
+ value: 84.40297506704299
253
+ - type: manhattan_f1
254
+ value: 77.63185324160932
255
+ - type: manhattan_precision
256
+ value: 70.03949595636637
257
+ - type: manhattan_recall
258
+ value: 87.07037643207856
259
+ - type: max_accuracy
260
+ value: 76.7047504509922
261
+ - type: max_ap
262
+ value: 85.26649874800871
263
+ - type: max_f1
264
+ value: 78.13528724646915
265
+ - task:
266
+ type: Retrieval
267
+ dataset:
268
+ type: C-MTEB/CovidRetrieval
269
+ name: MTEB CovidRetrieval
270
+ config: default
271
+ split: dev
272
+ revision: None
273
+ metrics:
274
+ - type: map_at_1
275
+ value: 69.178
276
+ - type: map_at_10
277
+ value: 77.523
278
+ - type: map_at_100
279
+ value: 77.793
280
+ - type: map_at_1000
281
+ value: 77.79899999999999
282
+ - type: map_at_3
283
+ value: 75.878
284
+ - type: map_at_5
285
+ value: 76.849
286
+ - type: mrr_at_1
287
+ value: 69.44200000000001
288
+ - type: mrr_at_10
289
+ value: 77.55
290
+ - type: mrr_at_100
291
+ value: 77.819
292
+ - type: mrr_at_1000
293
+ value: 77.826
294
+ - type: mrr_at_3
295
+ value: 75.957
296
+ - type: mrr_at_5
297
+ value: 76.916
298
+ - type: ndcg_at_1
299
+ value: 69.44200000000001
300
+ - type: ndcg_at_10
301
+ value: 81.217
302
+ - type: ndcg_at_100
303
+ value: 82.45
304
+ - type: ndcg_at_1000
305
+ value: 82.636
306
+ - type: ndcg_at_3
307
+ value: 77.931
308
+ - type: ndcg_at_5
309
+ value: 79.655
310
+ - type: precision_at_1
311
+ value: 69.44200000000001
312
+ - type: precision_at_10
313
+ value: 9.357
314
+ - type: precision_at_100
315
+ value: 0.993
316
+ - type: precision_at_1000
317
+ value: 0.101
318
+ - type: precision_at_3
319
+ value: 28.1
320
+ - type: precision_at_5
321
+ value: 17.724
322
+ - type: recall_at_1
323
+ value: 69.178
324
+ - type: recall_at_10
325
+ value: 92.624
326
+ - type: recall_at_100
327
+ value: 98.209
328
+ - type: recall_at_1000
329
+ value: 99.684
330
+ - type: recall_at_3
331
+ value: 83.772
332
+ - type: recall_at_5
333
+ value: 87.882
334
+ - task:
335
+ type: Retrieval
336
+ dataset:
337
+ type: C-MTEB/DuRetrieval
338
+ name: MTEB DuRetrieval
339
+ config: default
340
+ split: dev
341
+ revision: None
342
+ metrics:
343
+ - type: map_at_1
344
+ value: 25.163999999999998
345
+ - type: map_at_10
346
+ value: 76.386
347
+ - type: map_at_100
348
+ value: 79.339
349
+ - type: map_at_1000
350
+ value: 79.39500000000001
351
+ - type: map_at_3
352
+ value: 52.959
353
+ - type: map_at_5
354
+ value: 66.59
355
+ - type: mrr_at_1
356
+ value: 87.9
357
+ - type: mrr_at_10
358
+ value: 91.682
359
+ - type: mrr_at_100
360
+ value: 91.747
361
+ - type: mrr_at_1000
362
+ value: 91.751
363
+ - type: mrr_at_3
364
+ value: 91.267
365
+ - type: mrr_at_5
366
+ value: 91.527
367
+ - type: ndcg_at_1
368
+ value: 87.9
369
+ - type: ndcg_at_10
370
+ value: 84.569
371
+ - type: ndcg_at_100
372
+ value: 87.83800000000001
373
+ - type: ndcg_at_1000
374
+ value: 88.322
375
+ - type: ndcg_at_3
376
+ value: 83.473
377
+ - type: ndcg_at_5
378
+ value: 82.178
379
+ - type: precision_at_1
380
+ value: 87.9
381
+ - type: precision_at_10
382
+ value: 40.605000000000004
383
+ - type: precision_at_100
384
+ value: 4.752
385
+ - type: precision_at_1000
386
+ value: 0.488
387
+ - type: precision_at_3
388
+ value: 74.9
389
+ - type: precision_at_5
390
+ value: 62.96000000000001
391
+ - type: recall_at_1
392
+ value: 25.163999999999998
393
+ - type: recall_at_10
394
+ value: 85.97399999999999
395
+ - type: recall_at_100
396
+ value: 96.63000000000001
397
+ - type: recall_at_1000
398
+ value: 99.016
399
+ - type: recall_at_3
400
+ value: 55.611999999999995
401
+ - type: recall_at_5
402
+ value: 71.936
403
+ - task:
404
+ type: Retrieval
405
+ dataset:
406
+ type: C-MTEB/EcomRetrieval
407
+ name: MTEB EcomRetrieval
408
+ config: default
409
+ split: dev
410
+ revision: None
411
+ metrics:
412
+ - type: map_at_1
413
+ value: 48.6
414
+ - type: map_at_10
415
+ value: 58.831
416
+ - type: map_at_100
417
+ value: 59.427
418
+ - type: map_at_1000
419
+ value: 59.44199999999999
420
+ - type: map_at_3
421
+ value: 56.383
422
+ - type: map_at_5
423
+ value: 57.753
424
+ - type: mrr_at_1
425
+ value: 48.6
426
+ - type: mrr_at_10
427
+ value: 58.831
428
+ - type: mrr_at_100
429
+ value: 59.427
430
+ - type: mrr_at_1000
431
+ value: 59.44199999999999
432
+ - type: mrr_at_3
433
+ value: 56.383
434
+ - type: mrr_at_5
435
+ value: 57.753
436
+ - type: ndcg_at_1
437
+ value: 48.6
438
+ - type: ndcg_at_10
439
+ value: 63.951
440
+ - type: ndcg_at_100
441
+ value: 66.72200000000001
442
+ - type: ndcg_at_1000
443
+ value: 67.13900000000001
444
+ - type: ndcg_at_3
445
+ value: 58.882
446
+ - type: ndcg_at_5
447
+ value: 61.373
448
+ - type: precision_at_1
449
+ value: 48.6
450
+ - type: precision_at_10
451
+ value: 8.01
452
+ - type: precision_at_100
453
+ value: 0.928
454
+ - type: precision_at_1000
455
+ value: 0.096
456
+ - type: precision_at_3
457
+ value: 22.033
458
+ - type: precision_at_5
459
+ value: 14.44
460
+ - type: recall_at_1
461
+ value: 48.6
462
+ - type: recall_at_10
463
+ value: 80.10000000000001
464
+ - type: recall_at_100
465
+ value: 92.80000000000001
466
+ - type: recall_at_1000
467
+ value: 96.1
468
+ - type: recall_at_3
469
+ value: 66.10000000000001
470
+ - type: recall_at_5
471
+ value: 72.2
472
+ - task:
473
+ type: Classification
474
+ dataset:
475
+ type: C-MTEB/IFlyTek-classification
476
+ name: MTEB IFlyTek
477
+ config: default
478
+ split: validation
479
+ revision: None
480
+ metrics:
481
+ - type: accuracy
482
+ value: 47.36437091188918
483
+ - type: f1
484
+ value: 36.60946954228577
485
+ - task:
486
+ type: Classification
487
+ dataset:
488
+ type: C-MTEB/JDReview-classification
489
+ name: MTEB JDReview
490
+ config: default
491
+ split: test
492
+ revision: None
493
+ metrics:
494
+ - type: accuracy
495
+ value: 79.5684803001876
496
+ - type: ap
497
+ value: 42.671935929201524
498
+ - type: f1
499
+ value: 73.31912729103752
500
+ - task:
501
+ type: STS
502
+ dataset:
503
+ type: C-MTEB/LCQMC
504
+ name: MTEB LCQMC
505
+ config: default
506
+ split: test
507
+ revision: None
508
+ metrics:
509
+ - type: cos_sim_pearson
510
+ value: 68.62670112113864
511
+ - type: cos_sim_spearman
512
+ value: 75.74009123170768
513
+ - type: euclidean_pearson
514
+ value: 73.93002595958237
515
+ - type: euclidean_spearman
516
+ value: 75.35222935003587
517
+ - type: manhattan_pearson
518
+ value: 73.89870445158144
519
+ - type: manhattan_spearman
520
+ value: 75.31714936339398
521
+ - task:
522
+ type: Reranking
523
+ dataset:
524
+ type: C-MTEB/Mmarco-reranking
525
+ name: MTEB MMarcoReranking
526
+ config: default
527
+ split: dev
528
+ revision: None
529
+ metrics:
530
+ - type: map
531
+ value: 31.5372713650176
532
+ - type: mrr
533
+ value: 30.163095238095238
534
+ - task:
535
+ type: Retrieval
536
+ dataset:
537
+ type: C-MTEB/MMarcoRetrieval
538
+ name: MTEB MMarcoRetrieval
539
+ config: default
540
+ split: dev
541
+ revision: None
542
+ metrics:
543
+ - type: map_at_1
544
+ value: 65.054
545
+ - type: map_at_10
546
+ value: 74.156
547
+ - type: map_at_100
548
+ value: 74.523
549
+ - type: map_at_1000
550
+ value: 74.535
551
+ - type: map_at_3
552
+ value: 72.269
553
+ - type: map_at_5
554
+ value: 73.41
555
+ - type: mrr_at_1
556
+ value: 67.24900000000001
557
+ - type: mrr_at_10
558
+ value: 74.78399999999999
559
+ - type: mrr_at_100
560
+ value: 75.107
561
+ - type: mrr_at_1000
562
+ value: 75.117
563
+ - type: mrr_at_3
564
+ value: 73.13499999999999
565
+ - type: mrr_at_5
566
+ value: 74.13499999999999
567
+ - type: ndcg_at_1
568
+ value: 67.24900000000001
569
+ - type: ndcg_at_10
570
+ value: 77.96300000000001
571
+ - type: ndcg_at_100
572
+ value: 79.584
573
+ - type: ndcg_at_1000
574
+ value: 79.884
575
+ - type: ndcg_at_3
576
+ value: 74.342
577
+ - type: ndcg_at_5
578
+ value: 76.278
579
+ - type: precision_at_1
580
+ value: 67.24900000000001
581
+ - type: precision_at_10
582
+ value: 9.466
583
+ - type: precision_at_100
584
+ value: 1.027
585
+ - type: precision_at_1000
586
+ value: 0.105
587
+ - type: precision_at_3
588
+ value: 27.955999999999996
589
+ - type: precision_at_5
590
+ value: 17.817
591
+ - type: recall_at_1
592
+ value: 65.054
593
+ - type: recall_at_10
594
+ value: 89.113
595
+ - type: recall_at_100
596
+ value: 96.369
597
+ - type: recall_at_1000
598
+ value: 98.714
599
+ - type: recall_at_3
600
+ value: 79.45400000000001
601
+ - type: recall_at_5
602
+ value: 84.06
603
+ - task:
604
+ type: Classification
605
+ dataset:
606
+ type: mteb/amazon_massive_intent
607
+ name: MTEB MassiveIntentClassification (zh-CN)
608
+ config: zh-CN
609
+ split: test
610
+ revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
611
+ metrics:
612
+ - type: accuracy
613
+ value: 68.1977135171486
614
+ - type: f1
615
+ value: 67.23114308718404
616
+ - task:
617
+ type: Classification
618
+ dataset:
619
+ type: mteb/amazon_massive_scenario
620
+ name: MTEB MassiveScenarioClassification (zh-CN)
621
+ config: zh-CN
622
+ split: test
623
+ revision: 7d571f92784cd94a019292a1f45445077d0ef634
624
+ metrics:
625
+ - type: accuracy
626
+ value: 71.92669804976462
627
+ - type: f1
628
+ value: 72.90628475628779
629
+ - task:
630
+ type: Retrieval
631
+ dataset:
632
+ type: C-MTEB/MedicalRetrieval
633
+ name: MTEB MedicalRetrieval
634
+ config: default
635
+ split: dev
636
+ revision: None
637
+ metrics:
638
+ - type: map_at_1
639
+ value: 49.2
640
+ - type: map_at_10
641
+ value: 54.539
642
+ - type: map_at_100
643
+ value: 55.135
644
+ - type: map_at_1000
645
+ value: 55.19199999999999
646
+ - type: map_at_3
647
+ value: 53.383
648
+ - type: map_at_5
649
+ value: 54.142999999999994
650
+ - type: mrr_at_1
651
+ value: 49.2
652
+ - type: mrr_at_10
653
+ value: 54.539
654
+ - type: mrr_at_100
655
+ value: 55.135999999999996
656
+ - type: mrr_at_1000
657
+ value: 55.19199999999999
658
+ - type: mrr_at_3
659
+ value: 53.383
660
+ - type: mrr_at_5
661
+ value: 54.142999999999994
662
+ - type: ndcg_at_1
663
+ value: 49.2
664
+ - type: ndcg_at_10
665
+ value: 57.123000000000005
666
+ - type: ndcg_at_100
667
+ value: 60.21300000000001
668
+ - type: ndcg_at_1000
669
+ value: 61.915
670
+ - type: ndcg_at_3
671
+ value: 54.772
672
+ - type: ndcg_at_5
673
+ value: 56.157999999999994
674
+ - type: precision_at_1
675
+ value: 49.2
676
+ - type: precision_at_10
677
+ value: 6.52
678
+ - type: precision_at_100
679
+ value: 0.8009999999999999
680
+ - type: precision_at_1000
681
+ value: 0.094
682
+ - type: precision_at_3
683
+ value: 19.6
684
+ - type: precision_at_5
685
+ value: 12.44
686
+ - type: recall_at_1
687
+ value: 49.2
688
+ - type: recall_at_10
689
+ value: 65.2
690
+ - type: recall_at_100
691
+ value: 80.10000000000001
692
+ - type: recall_at_1000
693
+ value: 93.89999999999999
694
+ - type: recall_at_3
695
+ value: 58.8
696
+ - type: recall_at_5
697
+ value: 62.2
698
+ - task:
699
+ type: Classification
700
+ dataset:
701
+ type: C-MTEB/MultilingualSentiment-classification
702
+ name: MTEB MultilingualSentiment
703
+ config: default
704
+ split: validation
705
+ revision: None
706
+ metrics:
707
+ - type: accuracy
708
+ value: 63.29333333333334
709
+ - type: f1
710
+ value: 63.03293854259612
711
+ - task:
712
+ type: PairClassification
713
+ dataset:
714
+ type: C-MTEB/OCNLI
715
+ name: MTEB Ocnli
716
+ config: default
717
+ split: validation
718
+ revision: None
719
+ metrics:
720
+ - type: cos_sim_accuracy
721
+ value: 75.69030860855442
722
+ - type: cos_sim_ap
723
+ value: 80.6157833772759
724
+ - type: cos_sim_f1
725
+ value: 77.87524366471735
726
+ - type: cos_sim_precision
727
+ value: 72.3076923076923
728
+ - type: cos_sim_recall
729
+ value: 84.37170010559663
730
+ - type: dot_accuracy
731
+ value: 67.78559826746074
732
+ - type: dot_ap
733
+ value: 72.00871467527499
734
+ - type: dot_f1
735
+ value: 72.58722247394654
736
+ - type: dot_precision
737
+ value: 63.57142857142857
738
+ - type: dot_recall
739
+ value: 84.58289334741288
740
+ - type: euclidean_accuracy
741
+ value: 75.20303194369248
742
+ - type: euclidean_ap
743
+ value: 80.98587256415605
744
+ - type: euclidean_f1
745
+ value: 77.26396917148362
746
+ - type: euclidean_precision
747
+ value: 71.03631532329496
748
+ - type: euclidean_recall
749
+ value: 84.68848996832101
750
+ - type: manhattan_accuracy
751
+ value: 75.20303194369248
752
+ - type: manhattan_ap
753
+ value: 80.93460699513219
754
+ - type: manhattan_f1
755
+ value: 77.124773960217
756
+ - type: manhattan_precision
757
+ value: 67.43083003952569
758
+ - type: manhattan_recall
759
+ value: 90.07391763463569
760
+ - type: max_accuracy
761
+ value: 75.69030860855442
762
+ - type: max_ap
763
+ value: 80.98587256415605
764
+ - type: max_f1
765
+ value: 77.87524366471735
766
+ - task:
767
+ type: Classification
768
+ dataset:
769
+ type: C-MTEB/OnlineShopping-classification
770
+ name: MTEB OnlineShopping
771
+ config: default
772
+ split: test
773
+ revision: None
774
+ metrics:
775
+ - type: accuracy
776
+ value: 87.00000000000001
777
+ - type: ap
778
+ value: 83.24372135949511
779
+ - type: f1
780
+ value: 86.95554191530607
781
+ - task:
782
+ type: STS
783
+ dataset:
784
+ type: C-MTEB/PAWSX
785
+ name: MTEB PAWSX
786
+ config: default
787
+ split: test
788
+ revision: None
789
+ metrics:
790
+ - type: cos_sim_pearson
791
+ value: 37.57616811591219
792
+ - type: cos_sim_spearman
793
+ value: 41.490259084930045
794
+ - type: euclidean_pearson
795
+ value: 38.9155043692188
796
+ - type: euclidean_spearman
797
+ value: 39.16056534305623
798
+ - type: manhattan_pearson
799
+ value: 38.76569892264335
800
+ - type: manhattan_spearman
801
+ value: 38.99891685590743
802
+ - task:
803
+ type: STS
804
+ dataset:
805
+ type: C-MTEB/QBQTC
806
+ name: MTEB QBQTC
807
+ config: default
808
+ split: test
809
+ revision: None
810
+ metrics:
811
+ - type: cos_sim_pearson
812
+ value: 35.44858610359665
813
+ - type: cos_sim_spearman
814
+ value: 38.11128146262466
815
+ - type: euclidean_pearson
816
+ value: 31.928644189822457
817
+ - type: euclidean_spearman
818
+ value: 34.384936631696554
819
+ - type: manhattan_pearson
820
+ value: 31.90586687414376
821
+ - type: manhattan_spearman
822
+ value: 34.35770153777186
823
+ - task:
824
+ type: STS
825
+ dataset:
826
+ type: mteb/sts22-crosslingual-sts
827
+ name: MTEB STS22 (zh)
828
+ config: zh
829
+ split: test
830
+ revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
831
+ metrics:
832
+ - type: cos_sim_pearson
833
+ value: 66.54931957553592
834
+ - type: cos_sim_spearman
835
+ value: 69.25068863016632
836
+ - type: euclidean_pearson
837
+ value: 50.26525596106869
838
+ - type: euclidean_spearman
839
+ value: 63.83352741910006
840
+ - type: manhattan_pearson
841
+ value: 49.98798282198196
842
+ - type: manhattan_spearman
843
+ value: 63.87649521907841
844
+ - task:
845
+ type: STS
846
+ dataset:
847
+ type: C-MTEB/STSB
848
+ name: MTEB STSB
849
+ config: default
850
+ split: test
851
+ revision: None
852
+ metrics:
853
+ - type: cos_sim_pearson
854
+ value: 82.52782476625825
855
+ - type: cos_sim_spearman
856
+ value: 82.55618986168398
857
+ - type: euclidean_pearson
858
+ value: 78.48190631687673
859
+ - type: euclidean_spearman
860
+ value: 78.39479731354655
861
+ - type: manhattan_pearson
862
+ value: 78.51176592165885
863
+ - type: manhattan_spearman
864
+ value: 78.42363787303265
865
+ - task:
866
+ type: Reranking
867
+ dataset:
868
+ type: C-MTEB/T2Reranking
869
+ name: MTEB T2Reranking
870
+ config: default
871
+ split: dev
872
+ revision: None
873
+ metrics:
874
+ - type: map
875
+ value: 67.36693873615643
876
+ - type: mrr
877
+ value: 77.83847701797939
878
+ - task:
879
+ type: Retrieval
880
+ dataset:
881
+ type: C-MTEB/T2Retrieval
882
+ name: MTEB T2Retrieval
883
+ config: default
884
+ split: dev
885
+ revision: None
886
+ metrics:
887
+ - type: map_at_1
888
+ value: 25.795
889
+ - type: map_at_10
890
+ value: 72.258
891
+ - type: map_at_100
892
+ value: 76.049
893
+ - type: map_at_1000
894
+ value: 76.134
895
+ - type: map_at_3
896
+ value: 50.697
897
+ - type: map_at_5
898
+ value: 62.324999999999996
899
+ - type: mrr_at_1
900
+ value: 86.634
901
+ - type: mrr_at_10
902
+ value: 89.792
903
+ - type: mrr_at_100
904
+ value: 89.91900000000001
905
+ - type: mrr_at_1000
906
+ value: 89.923
907
+ - type: mrr_at_3
908
+ value: 89.224
909
+ - type: mrr_at_5
910
+ value: 89.608
911
+ - type: ndcg_at_1
912
+ value: 86.634
913
+ - type: ndcg_at_10
914
+ value: 80.589
915
+ - type: ndcg_at_100
916
+ value: 84.812
917
+ - type: ndcg_at_1000
918
+ value: 85.662
919
+ - type: ndcg_at_3
920
+ value: 82.169
921
+ - type: ndcg_at_5
922
+ value: 80.619
923
+ - type: precision_at_1
924
+ value: 86.634
925
+ - type: precision_at_10
926
+ value: 40.389
927
+ - type: precision_at_100
928
+ value: 4.93
929
+ - type: precision_at_1000
930
+ value: 0.513
931
+ - type: precision_at_3
932
+ value: 72.104
933
+ - type: precision_at_5
934
+ value: 60.425
935
+ - type: recall_at_1
936
+ value: 25.795
937
+ - type: recall_at_10
938
+ value: 79.565
939
+ - type: recall_at_100
940
+ value: 93.24799999999999
941
+ - type: recall_at_1000
942
+ value: 97.595
943
+ - type: recall_at_3
944
+ value: 52.583999999999996
945
+ - type: recall_at_5
946
+ value: 66.175
947
+ - task:
948
+ type: Classification
949
+ dataset:
950
+ type: C-MTEB/TNews-classification
951
+ name: MTEB TNews
952
+ config: default
953
+ split: validation
954
+ revision: None
955
+ metrics:
956
+ - type: accuracy
957
+ value: 47.648999999999994
958
+ - type: f1
959
+ value: 46.28925837008413
960
+ - task:
961
+ type: Clustering
962
+ dataset:
963
+ type: C-MTEB/ThuNewsClusteringP2P
964
+ name: MTEB ThuNewsClusteringP2P
965
+ config: default
966
+ split: test
967
+ revision: None
968
+ metrics:
969
+ - type: v_measure
970
+ value: 54.07641891287953
971
+ - task:
972
+ type: Clustering
973
+ dataset:
974
+ type: C-MTEB/ThuNewsClusteringS2S
975
+ name: MTEB ThuNewsClusteringS2S
976
+ config: default
977
+ split: test
978
+ revision: None
979
+ metrics:
980
+ - type: v_measure
981
+ value: 53.423702062353954
982
+ - task:
983
+ type: Retrieval
984
+ dataset:
985
+ type: C-MTEB/VideoRetrieval
986
+ name: MTEB VideoRetrieval
987
+ config: default
988
+ split: dev
989
+ revision: None
990
+ metrics:
991
+ - type: map_at_1
992
+ value: 55.7
993
+ - type: map_at_10
994
+ value: 65.923
995
+ - type: map_at_100
996
+ value: 66.42
997
+ - type: map_at_1000
998
+ value: 66.431
999
+ - type: map_at_3
1000
+ value: 63.9
1001
+ - type: map_at_5
1002
+ value: 65.225
1003
+ - type: mrr_at_1
1004
+ value: 55.60000000000001
1005
+ - type: mrr_at_10
1006
+ value: 65.873
1007
+ - type: mrr_at_100
1008
+ value: 66.36999999999999
1009
+ - type: mrr_at_1000
1010
+ value: 66.381
1011
+ - type: mrr_at_3
1012
+ value: 63.849999999999994
1013
+ - type: mrr_at_5
1014
+ value: 65.17500000000001
1015
+ - type: ndcg_at_1
1016
+ value: 55.7
1017
+ - type: ndcg_at_10
1018
+ value: 70.621
1019
+ - type: ndcg_at_100
1020
+ value: 72.944
1021
+ - type: ndcg_at_1000
1022
+ value: 73.25399999999999
1023
+ - type: ndcg_at_3
1024
+ value: 66.547
1025
+ - type: ndcg_at_5
1026
+ value: 68.93599999999999
1027
+ - type: precision_at_1
1028
+ value: 55.7
1029
+ - type: precision_at_10
1030
+ value: 8.52
1031
+ - type: precision_at_100
1032
+ value: 0.958
1033
+ - type: precision_at_1000
1034
+ value: 0.098
1035
+ - type: precision_at_3
1036
+ value: 24.733
1037
+ - type: precision_at_5
1038
+ value: 16
1039
+ - type: recall_at_1
1040
+ value: 55.7
1041
+ - type: recall_at_10
1042
+ value: 85.2
1043
+ - type: recall_at_100
1044
+ value: 95.8
1045
+ - type: recall_at_1000
1046
+ value: 98.3
1047
+ - type: recall_at_3
1048
+ value: 74.2
1049
+ - type: recall_at_5
1050
+ value: 80
1051
+ - task:
1052
+ type: Classification
1053
+ dataset:
1054
+ type: C-MTEB/waimai-classification
1055
+ name: MTEB Waimai
1056
+ config: default
1057
+ split: test
1058
+ revision: None
1059
+ metrics:
1060
+ - type: accuracy
1061
+ value: 84.54
1062
+ - type: ap
1063
+ value: 66.13603199670062
1064
+ - type: f1
1065
+ value: 82.61420654584116
1066
+ ---
1067
+ <!-- TODO: add evaluation results here -->
1068
+ <br><br>
1069
+
1070
+ <p align="center">
1071
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
1072
+ </p>
1073
+
1074
+
1075
+ <p align="center">
1076
+ <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
1077
+ </p>
1078
+
1079
+ ## Quick Start
1080
+
1081
+ The easiest way to starting using `jina-embeddings-v2-base-zh` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
1082
+
1083
+ ## Intended Usage & Model Info
1084
+
1085
+ `jina-embeddings-v2-base-zh` is a Chinese/English bilingual text **embedding model** supporting **8192 sequence length**.
1086
+ It is based on a BERT architecture (JinaBERT) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
1087
+ We have designed it for high performance in mono-lingual & cross-lingual applications and trained it specifically to support mixed Chinese-English input without bias.
1088
+ Additionally, we provide the following embedding models:
1089
+
1090
+ `jina-embeddings-v2-base-zh` 是支持中英双语的**文本向量**模型,它支持长达**8192字符**的文本编码。
1091
+ 该模型的研发基于BERT架构(JinaBERT),JinaBERT是在BERT架构基础上的改进,首次将[ALiBi](https://arxiv.org/abs/2108.12409)应用到编码器架构中以支持更长的序列。
1092
+ 不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。
1093
+ 除此之外,我们也提供其它向量模型:
1094
+
1095
+ - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
1096
+ - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
1097
+ - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English Bilingual embeddings **(you are here)**.
1098
+ - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English Bilingual embeddings.
1099
+ - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
1100
+ - [`jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code): 161 million parameters code embeddings.
1101
+
1102
+ ## Data & Parameters
1103
+
1104
+ The data and training details are described in this [technical report](https://arxiv.org/abs/2402.17016).
1105
+
1106
+
1107
+ ## Usage
1108
+
1109
+ **<details><summary>Please apply mean pooling when integrating the model.</summary>**
1110
+ <p>
1111
+
1112
+ ### Why mean pooling?
1113
+
1114
+ `mean poooling` takes all token embeddings from model output and averaging them at sentence/paragraph level.
1115
+ It has been proved to be the most effective way to produce high-quality sentence embeddings.
1116
+ We offer an `encode` function to deal with this.
1117
+
1118
+ However, if you would like to do it without using the default `encode` function:
1119
+
1120
+ ```python
1121
+ import torch
1122
+ import torch.nn.functional as F
1123
+ from transformers import AutoTokenizer, AutoModel
1124
+
1125
+ def mean_pooling(model_output, attention_mask):
1126
+ token_embeddings = model_output[0]
1127
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
1128
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
1129
+
1130
+ sentences = ['How is the weather today?', '今天天气怎么样?']
1131
+
1132
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
1133
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
1134
+
1135
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
1136
+
1137
+ with torch.no_grad():
1138
+ model_output = model(**encoded_input)
1139
+
1140
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
1141
+ embeddings = F.normalize(embeddings, p=2, dim=1)
1142
+ ```
1143
+
1144
+ </p>
1145
+ </details>
1146
+
1147
+ You can use Jina Embedding models directly from transformers package.
1148
+
1149
+ ```python
1150
+ !pip install transformers
1151
+ import torch
1152
+ from transformers import AutoModel
1153
+ from numpy.linalg import norm
1154
+
1155
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1156
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
1157
+ embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
1158
+ print(cos_sim(embeddings[0], embeddings[1]))
1159
+ ```
1160
+
1161
+ If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
1162
+
1163
+ ```python
1164
+ embeddings = model.encode(
1165
+ ['Very long ... document'],
1166
+ max_length=2048
1167
+ )
1168
+ ```
1169
+
1170
+ If you want to use the model together with the [sentence-transformers package](https://github.com/UKPLab/sentence-transformers/), make sure that you have installed the latest release and set `trust_remote_code=True` as well:
1171
+
1172
+ ```python
1173
+ !pip install -U sentence-transformers
1174
+ from sentence_transformers import SentenceTransformer
1175
+ from numpy.linalg import norm
1176
+
1177
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1178
+ model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
1179
+ embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
1180
+ print(cos_sim(embeddings[0], embeddings[1]))
1181
+ ```
1182
+
1183
+ Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):
1184
+
1185
+ ```python
1186
+ !pip install -U sentence-transformers
1187
+ from sentence_transformers import SentenceTransformer
1188
+ from sentence_transformers.util import cos_sim
1189
+
1190
+ model = SentenceTransformer(
1191
+ "jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
1192
+ trust_remote_code=True
1193
+ )
1194
+
1195
+ # control your input sequence length up to 8192
1196
+ model.max_seq_length = 1024
1197
+
1198
+ embeddings = model.encode([
1199
+ 'How is the weather today?',
1200
+ '今天天气怎么样?'
1201
+ ])
1202
+ print(cos_sim(embeddings[0], embeddings[1]))
1203
+ ```
1204
+
1205
+ ## Alternatives to Using Transformers Package
1206
+
1207
+ 1. _Managed SaaS_: Get started with a free key on Jina AI's [Embedding API](https://jina.ai/embeddings/).
1208
+ 2. _Private and high-performance deployment_: Get started by picking from our suite of models and deploy them on [AWS Sagemaker](https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy).
1209
+
1210
+ ## Use Jina Embeddings for RAG
1211
+
1212
+ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83),
1213
+
1214
+ > In summary, to achieve the peak performance in both hit rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with the CohereRerank/bge-reranker-large reranker stands out.
1215
+
1216
+ <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
1217
+
1218
+ ## Trouble Shooting
1219
+
1220
+ **Loading of Model Code failed**
1221
+
1222
+ If you forgot to pass the `trust_remote_code=True` flag when calling `AutoModel.from_pretrained` or initializing the model via the `SentenceTransformer` class, you will receive an error that the model weights could not be initialized.
1223
+ This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:
1224
+
1225
+ ```bash
1226
+ Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...
1227
+ ```
1228
+
1229
+ **User is not logged into Huggingface**
1230
+
1231
+ The model is only availabe under [gated access](https://huggingface.co/docs/hub/models-gated).
1232
+ This means you need to be logged into huggingface load load it.
1233
+ If you receive the following error, you need to provide an access token, either by using the huggingface-cli or providing the token via an environment variable as described above:
1234
+ ```bash
1235
+ OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
1236
+ If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
1237
+ ```
1238
+
1239
+ ## Contact
1240
+
1241
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
1242
+
1243
+ ## Citation
1244
+
1245
+ If you find Jina Embeddings useful in your research, please cite the following paper:
1246
+
1247
+ ```
1248
+ @article{mohr2024multi,
1249
+ title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
1250
+ author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
1251
+ journal={arXiv preprint arXiv:2402.17016},
1252
+ year={2024}
1253
+ }
1254
+ ```
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "jinaai/jina-bert-implementation",
3
+ "architectures": [
4
+ "JinaBertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "attn_implementation": "torch",
8
+ "auto_map": {
9
+ "AutoConfig": "jinaai/jina-bert-implementation--configuration_bert.JinaBertConfig",
10
+ "AutoModel": "jinaai/jina-bert-implementation--modeling_bert.JinaBertModel",
11
+ "AutoModelForMaskedLM": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForMaskedLM",
12
+ "AutoModelForQuestionAnswering": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForQuestionAnswering",
13
+ "AutoModelForSequenceClassification": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForSequenceClassification",
14
+ "AutoModelForTokenClassification": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForTokenClassification"
15
+ },
16
+ "classifier_dropout": null,
17
+ "emb_pooler": "mean",
18
+ "feed_forward_type": "geglu",
19
+ "gradient_checkpointing": false,
20
+ "hidden_act": "gelu",
21
+ "hidden_dropout_prob": 0.1,
22
+ "hidden_size": 768,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 3072,
25
+ "layer_norm_eps": 1e-12,
26
+ "max_position_embeddings": 8192,
27
+ "model_max_length": 8192,
28
+ "model_type": "bert",
29
+ "num_attention_heads": 12,
30
+ "num_hidden_layers": 12,
31
+ "pad_token_id": 0,
32
+ "position_embedding_type": "alibi",
33
+ "torch_dtype": "float16",
34
+ "transformers_version": "4.30.2",
35
+ "type_vocab_size": 2,
36
+ "use_cache": true,
37
+ "vocab_size": 61056
38
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.30.2",
5
+ "pytorch": "2.0.1"
6
+ }
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29b7cdda0c8fa9b18f8e0fbc4aba0c9537555fb16b139fa44be92c1e1b3253a8
3
+ size 321648328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false,
4
+ "model_args": {"trust_remote_code": true}
5
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 512,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "<unk>"
57
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff