|
--- |
|
tags: |
|
- mteb |
|
- sentence-transformers |
|
model-index: |
|
- name: piccolo-large-zh-v2 |
|
results: |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/AFQMC |
|
name: MTEB AFQMC |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 56.76055988260572 |
|
- type: cos_sim_spearman |
|
value: 61.49271876861677 |
|
- type: euclidean_pearson |
|
value: 59.14524585320711 |
|
- type: euclidean_spearman |
|
value: 60.63579339225774 |
|
- type: manhattan_pearson |
|
value: 59.14662752965445 |
|
- type: manhattan_spearman |
|
value: 60.635190265737904 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/ATEC |
|
name: MTEB ATEC |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 56.21706298831197 |
|
- type: cos_sim_spearman |
|
value: 59.19831457688953 |
|
- type: euclidean_pearson |
|
value: 62.37752017633299 |
|
- type: euclidean_spearman |
|
value: 58.79400967473204 |
|
- type: manhattan_pearson |
|
value: 62.37015943212308 |
|
- type: manhattan_spearman |
|
value: 58.79232537600814 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: mteb/amazon_reviews_multi |
|
name: MTEB AmazonReviewsClassification (zh) |
|
config: zh |
|
split: test |
|
revision: 1399c76144fd37290681b995c656ef9b2e06e26d |
|
metrics: |
|
- type: accuracy |
|
value: 49.440000000000005 |
|
- type: f1 |
|
value: 46.67381446305019 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/BQ |
|
name: MTEB BQ |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 70.99026329599994 |
|
- type: cos_sim_spearman |
|
value: 72.87565357908989 |
|
- type: euclidean_pearson |
|
value: 71.17690439270028 |
|
- type: euclidean_spearman |
|
value: 72.50428109969029 |
|
- type: manhattan_pearson |
|
value: 71.17262321033088 |
|
- type: manhattan_spearman |
|
value: 72.49845447987437 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/CLSClusteringP2P |
|
name: MTEB CLSClusteringP2P |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 57.92713421071616 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/CLSClusteringS2S |
|
name: MTEB CLSClusteringS2S |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 48.096546680932235 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv1-reranking |
|
name: MTEB CMedQAv1 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 89.31003741715936 |
|
- type: mrr |
|
value: 91.38075396825397 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv2-reranking |
|
name: MTEB CMedQAv2 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 90.13769781784876 |
|
- type: mrr |
|
value: 92.14329365079365 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CmedqaRetrieval |
|
name: MTEB CmedqaRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 26.931 |
|
- type: map_at_10 |
|
value: 40.647 |
|
- type: map_at_100 |
|
value: 42.519 |
|
- type: map_at_1000 |
|
value: 42.616 |
|
- type: map_at_3 |
|
value: 36.144999999999996 |
|
- type: map_at_5 |
|
value: 38.717 |
|
- type: mrr_at_1 |
|
value: 40.935 |
|
- type: mrr_at_10 |
|
value: 49.684 |
|
- type: mrr_at_100 |
|
value: 50.598 |
|
- type: mrr_at_1000 |
|
value: 50.632999999999996 |
|
- type: mrr_at_3 |
|
value: 47.07 |
|
- type: mrr_at_5 |
|
value: 48.49 |
|
- type: ndcg_at_1 |
|
value: 40.935 |
|
- type: ndcg_at_10 |
|
value: 47.583999999999996 |
|
- type: ndcg_at_100 |
|
value: 54.69199999999999 |
|
- type: ndcg_at_1000 |
|
value: 56.314 |
|
- type: ndcg_at_3 |
|
value: 41.973 |
|
- type: ndcg_at_5 |
|
value: 44.334 |
|
- type: precision_at_1 |
|
value: 40.935 |
|
- type: precision_at_10 |
|
value: 10.585 |
|
- type: precision_at_100 |
|
value: 1.637 |
|
- type: precision_at_1000 |
|
value: 0.184 |
|
- type: precision_at_3 |
|
value: 23.881 |
|
- type: precision_at_5 |
|
value: 17.399 |
|
- type: recall_at_1 |
|
value: 26.931 |
|
- type: recall_at_10 |
|
value: 59.006 |
|
- type: recall_at_100 |
|
value: 88.247 |
|
- type: recall_at_1000 |
|
value: 99.045 |
|
- type: recall_at_3 |
|
value: 42.064 |
|
- type: recall_at_5 |
|
value: 49.266 |
|
- task: |
|
type: PairClassification |
|
dataset: |
|
type: C-MTEB/CMNLI |
|
name: MTEB Cmnli |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: cos_sim_accuracy |
|
value: 86.08538785327721 |
|
- type: cos_sim_ap |
|
value: 92.64373114205229 |
|
- type: cos_sim_f1 |
|
value: 86.89951395953432 |
|
- type: cos_sim_precision |
|
value: 84.11378555798687 |
|
- type: cos_sim_recall |
|
value: 89.87608136544307 |
|
- type: dot_accuracy |
|
value: 72.66386049308478 |
|
- type: dot_ap |
|
value: 81.053422935767 |
|
- type: dot_f1 |
|
value: 75.19933726830277 |
|
- type: dot_precision |
|
value: 67.4907063197026 |
|
- type: dot_recall |
|
value: 84.89595510872107 |
|
- type: euclidean_accuracy |
|
value: 85.52014431749849 |
|
- type: euclidean_ap |
|
value: 91.90647782899615 |
|
- type: euclidean_f1 |
|
value: 86.26361413647477 |
|
- type: euclidean_precision |
|
value: 82.2071595001059 |
|
- type: euclidean_recall |
|
value: 90.74117371989713 |
|
- type: manhattan_accuracy |
|
value: 85.48406494287433 |
|
- type: manhattan_ap |
|
value: 91.89657919524385 |
|
- type: manhattan_f1 |
|
value: 86.20413761572752 |
|
- type: manhattan_precision |
|
value: 84.324686940966 |
|
- type: manhattan_recall |
|
value: 88.16927753097966 |
|
- type: max_accuracy |
|
value: 86.08538785327721 |
|
- type: max_ap |
|
value: 92.64373114205229 |
|
- type: max_f1 |
|
value: 86.89951395953432 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CovidRetrieval |
|
name: MTEB CovidRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 75.50099999999999 |
|
- type: map_at_10 |
|
value: 83.43 |
|
- type: map_at_100 |
|
value: 83.577 |
|
- type: map_at_1000 |
|
value: 83.57900000000001 |
|
- type: map_at_3 |
|
value: 82.06400000000001 |
|
- type: map_at_5 |
|
value: 82.88600000000001 |
|
- type: mrr_at_1 |
|
value: 75.869 |
|
- type: mrr_at_10 |
|
value: 83.536 |
|
- type: mrr_at_100 |
|
value: 83.682 |
|
- type: mrr_at_1000 |
|
value: 83.68299999999999 |
|
- type: mrr_at_3 |
|
value: 82.244 |
|
- type: mrr_at_5 |
|
value: 82.998 |
|
- type: ndcg_at_1 |
|
value: 75.764 |
|
- type: ndcg_at_10 |
|
value: 86.777 |
|
- type: ndcg_at_100 |
|
value: 87.36 |
|
- type: ndcg_at_1000 |
|
value: 87.424 |
|
- type: ndcg_at_3 |
|
value: 84.10300000000001 |
|
- type: ndcg_at_5 |
|
value: 85.532 |
|
- type: precision_at_1 |
|
value: 75.764 |
|
- type: precision_at_10 |
|
value: 9.8 |
|
- type: precision_at_100 |
|
value: 1.005 |
|
- type: precision_at_1000 |
|
value: 0.101 |
|
- type: precision_at_3 |
|
value: 30.207 |
|
- type: precision_at_5 |
|
value: 18.82 |
|
- type: recall_at_1 |
|
value: 75.50099999999999 |
|
- type: recall_at_10 |
|
value: 96.997 |
|
- type: recall_at_100 |
|
value: 99.473 |
|
- type: recall_at_1000 |
|
value: 100.0 |
|
- type: recall_at_3 |
|
value: 89.831 |
|
- type: recall_at_5 |
|
value: 93.256 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/DuRetrieval |
|
name: MTEB DuRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 27.094 |
|
- type: map_at_10 |
|
value: 82.418 |
|
- type: map_at_100 |
|
value: 85.05 |
|
- type: map_at_1000 |
|
value: 85.083 |
|
- type: map_at_3 |
|
value: 57.68600000000001 |
|
- type: map_at_5 |
|
value: 72.476 |
|
- type: mrr_at_1 |
|
value: 92.25 |
|
- type: mrr_at_10 |
|
value: 94.621 |
|
- type: mrr_at_100 |
|
value: 94.675 |
|
- type: mrr_at_1000 |
|
value: 94.677 |
|
- type: mrr_at_3 |
|
value: 94.375 |
|
- type: mrr_at_5 |
|
value: 94.52199999999999 |
|
- type: ndcg_at_1 |
|
value: 92.25 |
|
- type: ndcg_at_10 |
|
value: 89.13600000000001 |
|
- type: ndcg_at_100 |
|
value: 91.532 |
|
- type: ndcg_at_1000 |
|
value: 91.836 |
|
- type: ndcg_at_3 |
|
value: 88.50099999999999 |
|
- type: ndcg_at_5 |
|
value: 87.251 |
|
- type: precision_at_1 |
|
value: 92.25 |
|
- type: precision_at_10 |
|
value: 42.295 |
|
- type: precision_at_100 |
|
value: 4.812 |
|
- type: precision_at_1000 |
|
value: 0.48900000000000005 |
|
- type: precision_at_3 |
|
value: 79.167 |
|
- type: precision_at_5 |
|
value: 66.56 |
|
- type: recall_at_1 |
|
value: 27.094 |
|
- type: recall_at_10 |
|
value: 89.816 |
|
- type: recall_at_100 |
|
value: 97.855 |
|
- type: recall_at_1000 |
|
value: 99.384 |
|
- type: recall_at_3 |
|
value: 59.557 |
|
- type: recall_at_5 |
|
value: 76.395 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/EcomRetrieval |
|
name: MTEB EcomRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 53.6 |
|
- type: map_at_10 |
|
value: 62.985 |
|
- type: map_at_100 |
|
value: 63.532999999999994 |
|
- type: map_at_1000 |
|
value: 63.546 |
|
- type: map_at_3 |
|
value: 60.617 |
|
- type: map_at_5 |
|
value: 62.017 |
|
- type: mrr_at_1 |
|
value: 53.6 |
|
- type: mrr_at_10 |
|
value: 62.985 |
|
- type: mrr_at_100 |
|
value: 63.532999999999994 |
|
- type: mrr_at_1000 |
|
value: 63.546 |
|
- type: mrr_at_3 |
|
value: 60.617 |
|
- type: mrr_at_5 |
|
value: 62.017 |
|
- type: ndcg_at_1 |
|
value: 53.6 |
|
- type: ndcg_at_10 |
|
value: 67.755 |
|
- type: ndcg_at_100 |
|
value: 70.366 |
|
- type: ndcg_at_1000 |
|
value: 70.696 |
|
- type: ndcg_at_3 |
|
value: 62.89900000000001 |
|
- type: ndcg_at_5 |
|
value: 65.437 |
|
- type: precision_at_1 |
|
value: 53.6 |
|
- type: precision_at_10 |
|
value: 8.28 |
|
- type: precision_at_100 |
|
value: 0.9490000000000001 |
|
- type: precision_at_1000 |
|
value: 0.098 |
|
- type: precision_at_3 |
|
value: 23.166999999999998 |
|
- type: precision_at_5 |
|
value: 15.14 |
|
- type: recall_at_1 |
|
value: 53.6 |
|
- type: recall_at_10 |
|
value: 82.8 |
|
- type: recall_at_100 |
|
value: 94.89999999999999 |
|
- type: recall_at_1000 |
|
value: 97.5 |
|
- type: recall_at_3 |
|
value: 69.5 |
|
- type: recall_at_5 |
|
value: 75.7 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/IFlyTek-classification |
|
name: MTEB IFlyTek |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 52.104655636783384 |
|
- type: f1 |
|
value: 41.025743582860514 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/JDReview-classification |
|
name: MTEB JDReview |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 88.57410881801127 |
|
- type: ap |
|
value: 59.49612312498937 |
|
- type: f1 |
|
value: 83.70595013666741 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/LCQMC |
|
name: MTEB LCQMC |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 74.00327736048256 |
|
- type: cos_sim_spearman |
|
value: 79.5459672237356 |
|
- type: euclidean_pearson |
|
value: 79.18300205389669 |
|
- type: euclidean_spearman |
|
value: 79.21872988987533 |
|
- type: manhattan_pearson |
|
value: 79.1715470733081 |
|
- type: manhattan_spearman |
|
value: 79.20756273498812 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MMarcoRetrieval |
|
name: MTEB MMarcoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 66.94600000000001 |
|
- type: map_at_10 |
|
value: 75.947 |
|
- type: map_at_100 |
|
value: 76.268 |
|
- type: map_at_1000 |
|
value: 76.28 |
|
- type: map_at_3 |
|
value: 74.13300000000001 |
|
- type: map_at_5 |
|
value: 75.28399999999999 |
|
- type: mrr_at_1 |
|
value: 69.241 |
|
- type: mrr_at_10 |
|
value: 76.532 |
|
- type: mrr_at_100 |
|
value: 76.816 |
|
- type: mrr_at_1000 |
|
value: 76.827 |
|
- type: mrr_at_3 |
|
value: 74.95 |
|
- type: mrr_at_5 |
|
value: 75.957 |
|
- type: ndcg_at_1 |
|
value: 69.241 |
|
- type: ndcg_at_10 |
|
value: 79.54299999999999 |
|
- type: ndcg_at_100 |
|
value: 80.95 |
|
- type: ndcg_at_1000 |
|
value: 81.252 |
|
- type: ndcg_at_3 |
|
value: 76.119 |
|
- type: ndcg_at_5 |
|
value: 78.069 |
|
- type: precision_at_1 |
|
value: 69.241 |
|
- type: precision_at_10 |
|
value: 9.576 |
|
- type: precision_at_100 |
|
value: 1.026 |
|
- type: precision_at_1000 |
|
value: 0.105 |
|
- type: precision_at_3 |
|
value: 28.571999999999996 |
|
- type: precision_at_5 |
|
value: 18.181 |
|
- type: recall_at_1 |
|
value: 66.94600000000001 |
|
- type: recall_at_10 |
|
value: 90.024 |
|
- type: recall_at_100 |
|
value: 96.3 |
|
- type: recall_at_1000 |
|
value: 98.656 |
|
- type: recall_at_3 |
|
value: 81.026 |
|
- type: recall_at_5 |
|
value: 85.658 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: mteb/amazon_massive_intent |
|
name: MTEB MassiveIntentClassification (zh-CN) |
|
config: zh-CN |
|
split: test |
|
revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 |
|
metrics: |
|
- type: accuracy |
|
value: 77.71015467383997 |
|
- type: f1 |
|
value: 74.32345894845358 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: mteb/amazon_massive_scenario |
|
name: MTEB MassiveScenarioClassification (zh-CN) |
|
config: zh-CN |
|
split: test |
|
revision: 7d571f92784cd94a019292a1f45445077d0ef634 |
|
metrics: |
|
- type: accuracy |
|
value: 85.63214525891055 |
|
- type: f1 |
|
value: 84.65303466003252 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MedicalRetrieval |
|
name: MTEB MedicalRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 55.50000000000001 |
|
- type: map_at_10 |
|
value: 61.66199999999999 |
|
- type: map_at_100 |
|
value: 62.13999999999999 |
|
- type: map_at_1000 |
|
value: 62.187000000000005 |
|
- type: map_at_3 |
|
value: 59.967000000000006 |
|
- type: map_at_5 |
|
value: 60.927 |
|
- type: mrr_at_1 |
|
value: 55.7 |
|
- type: mrr_at_10 |
|
value: 61.76199999999999 |
|
- type: mrr_at_100 |
|
value: 62.241 |
|
- type: mrr_at_1000 |
|
value: 62.287000000000006 |
|
- type: mrr_at_3 |
|
value: 60.06700000000001 |
|
- type: mrr_at_5 |
|
value: 61.027 |
|
- type: ndcg_at_1 |
|
value: 55.50000000000001 |
|
- type: ndcg_at_10 |
|
value: 64.878 |
|
- type: ndcg_at_100 |
|
value: 67.464 |
|
- type: ndcg_at_1000 |
|
value: 68.745 |
|
- type: ndcg_at_3 |
|
value: 61.367000000000004 |
|
- type: ndcg_at_5 |
|
value: 63.117999999999995 |
|
- type: precision_at_1 |
|
value: 55.50000000000001 |
|
- type: precision_at_10 |
|
value: 7.51 |
|
- type: precision_at_100 |
|
value: 0.878 |
|
- type: precision_at_1000 |
|
value: 0.098 |
|
- type: precision_at_3 |
|
value: 21.8 |
|
- type: precision_at_5 |
|
value: 13.94 |
|
- type: recall_at_1 |
|
value: 55.50000000000001 |
|
- type: recall_at_10 |
|
value: 75.1 |
|
- type: recall_at_100 |
|
value: 87.8 |
|
- type: recall_at_1000 |
|
value: 97.89999999999999 |
|
- type: recall_at_3 |
|
value: 65.4 |
|
- type: recall_at_5 |
|
value: 69.69999999999999 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/Mmarco-reranking |
|
name: MTEB MMarcoReranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 33.386980266936106 |
|
- type: mrr |
|
value: 32.11904761904762 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/MultilingualSentiment-classification |
|
name: MTEB MultilingualSentiment |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 79.08666666666666 |
|
- type: f1 |
|
value: 78.93142205976953 |
|
- task: |
|
type: PairClassification |
|
dataset: |
|
type: C-MTEB/OCNLI |
|
name: MTEB Ocnli |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: cos_sim_accuracy |
|
value: 84.35300487276665 |
|
- type: cos_sim_ap |
|
value: 87.83572265803564 |
|
- type: cos_sim_f1 |
|
value: 85.42713567839195 |
|
- type: cos_sim_precision |
|
value: 81.49568552253116 |
|
- type: cos_sim_recall |
|
value: 89.7571277719113 |
|
- type: dot_accuracy |
|
value: 72.87493232268544 |
|
- type: dot_ap |
|
value: 80.29032993894747 |
|
- type: dot_f1 |
|
value: 76.5938475256353 |
|
- type: dot_precision |
|
value: 66.28086419753086 |
|
- type: dot_recall |
|
value: 90.70749736008447 |
|
- type: euclidean_accuracy |
|
value: 82.34975636166757 |
|
- type: euclidean_ap |
|
value: 85.73873757468064 |
|
- type: euclidean_f1 |
|
value: 83.56713426853707 |
|
- type: euclidean_precision |
|
value: 79.50428979980934 |
|
- type: euclidean_recall |
|
value: 88.0675818373812 |
|
- type: manhattan_accuracy |
|
value: 82.45804006497022 |
|
- type: manhattan_ap |
|
value: 85.7176464290469 |
|
- type: manhattan_f1 |
|
value: 83.65095285857572 |
|
- type: manhattan_precision |
|
value: 79.65616045845272 |
|
- type: manhattan_recall |
|
value: 88.0675818373812 |
|
- type: max_accuracy |
|
value: 84.35300487276665 |
|
- type: max_ap |
|
value: 87.83572265803564 |
|
- type: max_f1 |
|
value: 85.42713567839195 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/OnlineShopping-classification |
|
name: MTEB OnlineShopping |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 94.61999999999999 |
|
- type: ap |
|
value: 92.74140430219491 |
|
- type: f1 |
|
value: 94.60775857122515 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/PAWSX |
|
name: MTEB PAWSX |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 39.75749234575995 |
|
- type: cos_sim_spearman |
|
value: 46.48035295363829 |
|
- type: euclidean_pearson |
|
value: 45.38711981599582 |
|
- type: euclidean_spearman |
|
value: 46.13915356562481 |
|
- type: manhattan_pearson |
|
value: 45.420770530489065 |
|
- type: manhattan_spearman |
|
value: 46.179913441143775 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/QBQTC |
|
name: MTEB QBQTC |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 44.02008249965321 |
|
- type: cos_sim_spearman |
|
value: 45.906917552219156 |
|
- type: euclidean_pearson |
|
value: 36.600317631983316 |
|
- type: euclidean_spearman |
|
value: 41.97740958824762 |
|
- type: manhattan_pearson |
|
value: 36.54329048509785 |
|
- type: manhattan_spearman |
|
value: 41.91222171040451 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: mteb/sts22-crosslingual-sts |
|
name: MTEB STS22 (zh) |
|
config: zh |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 60.97044608578288 |
|
- type: cos_sim_spearman |
|
value: 63.76187490245927 |
|
- type: euclidean_pearson |
|
value: 60.74245987426317 |
|
- type: euclidean_spearman |
|
value: 63.32990713078846 |
|
- type: manhattan_pearson |
|
value: 60.62422616577702 |
|
- type: manhattan_spearman |
|
value: 63.256612476686826 |
|
- task: |
|
type: STS |
|
dataset: |
|
type: C-MTEB/STSB |
|
name: MTEB STSB |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: cos_sim_pearson |
|
value: 76.28185867362305 |
|
- type: cos_sim_spearman |
|
value: 78.71478656159289 |
|
- type: euclidean_pearson |
|
value: 79.80734359535234 |
|
- type: euclidean_spearman |
|
value: 79.85403491297063 |
|
- type: manhattan_pearson |
|
value: 79.79454037962215 |
|
- type: manhattan_spearman |
|
value: 79.82796402623201 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/T2Reranking |
|
name: MTEB T2Reranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 67.14759526113295 |
|
- type: mrr |
|
value: 77.36422096484723 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/T2Retrieval |
|
name: MTEB T2Retrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 28.177999999999997 |
|
- type: map_at_10 |
|
value: 78.77199999999999 |
|
- type: map_at_100 |
|
value: 82.365 |
|
- type: map_at_1000 |
|
value: 82.422 |
|
- type: map_at_3 |
|
value: 55.452999999999996 |
|
- type: map_at_5 |
|
value: 68.12700000000001 |
|
- type: mrr_at_1 |
|
value: 91.097 |
|
- type: mrr_at_10 |
|
value: 93.52000000000001 |
|
- type: mrr_at_100 |
|
value: 93.587 |
|
- type: mrr_at_1000 |
|
value: 93.589 |
|
- type: mrr_at_3 |
|
value: 93.136 |
|
- type: mrr_at_5 |
|
value: 93.381 |
|
- type: ndcg_at_1 |
|
value: 91.097 |
|
- type: ndcg_at_10 |
|
value: 86.136 |
|
- type: ndcg_at_100 |
|
value: 89.515 |
|
- type: ndcg_at_1000 |
|
value: 90.049 |
|
- type: ndcg_at_3 |
|
value: 87.41600000000001 |
|
- type: ndcg_at_5 |
|
value: 86.115 |
|
- type: precision_at_1 |
|
value: 91.097 |
|
- type: precision_at_10 |
|
value: 42.597 |
|
- type: precision_at_100 |
|
value: 5.043 |
|
- type: precision_at_1000 |
|
value: 0.517 |
|
- type: precision_at_3 |
|
value: 76.239 |
|
- type: precision_at_5 |
|
value: 63.93 |
|
- type: recall_at_1 |
|
value: 28.177999999999997 |
|
- type: recall_at_10 |
|
value: 85.182 |
|
- type: recall_at_100 |
|
value: 96.174 |
|
- type: recall_at_1000 |
|
value: 98.848 |
|
- type: recall_at_3 |
|
value: 57.150999999999996 |
|
- type: recall_at_5 |
|
value: 71.50999999999999 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/TNews-classification |
|
name: MTEB TNews |
|
config: default |
|
split: validation |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 54.521 |
|
- type: f1 |
|
value: 52.53528052282081 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/ThuNewsClusteringP2P |
|
name: MTEB ThuNewsClusteringP2P |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 74.2003249023509 |
|
- task: |
|
type: Clustering |
|
dataset: |
|
type: C-MTEB/ThuNewsClusteringS2S |
|
name: MTEB ThuNewsClusteringS2S |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: v_measure |
|
value: 68.4277378629746 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/VideoRetrieval |
|
name: MTEB VideoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 58.599999999999994 |
|
- type: map_at_10 |
|
value: 68.671 |
|
- type: map_at_100 |
|
value: 69.148 |
|
- type: map_at_1000 |
|
value: 69.157 |
|
- type: map_at_3 |
|
value: 66.9 |
|
- type: map_at_5 |
|
value: 68.045 |
|
- type: mrr_at_1 |
|
value: 58.599999999999994 |
|
- type: mrr_at_10 |
|
value: 68.671 |
|
- type: mrr_at_100 |
|
value: 69.148 |
|
- type: mrr_at_1000 |
|
value: 69.157 |
|
- type: mrr_at_3 |
|
value: 66.9 |
|
- type: mrr_at_5 |
|
value: 68.045 |
|
- type: ndcg_at_1 |
|
value: 58.599999999999994 |
|
- type: ndcg_at_10 |
|
value: 73.099 |
|
- type: ndcg_at_100 |
|
value: 75.33 |
|
- type: ndcg_at_1000 |
|
value: 75.58500000000001 |
|
- type: ndcg_at_3 |
|
value: 69.502 |
|
- type: ndcg_at_5 |
|
value: 71.542 |
|
- type: precision_at_1 |
|
value: 58.599999999999994 |
|
- type: precision_at_10 |
|
value: 8.68 |
|
- type: precision_at_100 |
|
value: 0.97 |
|
- type: precision_at_1000 |
|
value: 0.099 |
|
- type: precision_at_3 |
|
value: 25.667 |
|
- type: precision_at_5 |
|
value: 16.38 |
|
- type: recall_at_1 |
|
value: 58.599999999999994 |
|
- type: recall_at_10 |
|
value: 86.8 |
|
- type: recall_at_100 |
|
value: 97.0 |
|
- type: recall_at_1000 |
|
value: 99.1 |
|
- type: recall_at_3 |
|
value: 77.0 |
|
- type: recall_at_5 |
|
value: 81.89999999999999 |
|
- task: |
|
type: Classification |
|
dataset: |
|
type: C-MTEB/waimai-classification |
|
name: MTEB Waimai |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: accuracy |
|
value: 89.58999999999999 |
|
- type: ap |
|
value: 75.69899834265364 |
|
- type: f1 |
|
value: 88.2026184757175 |
|
--- |
|
[EN](README.md) | [简体中文](README_zh.md) |
|
|
|
**News** |
|
**[2024-05-16]** |
|
Due to certain internal company considerations, we have temporarily removed the model weights. |
|
It will be uploaded again after passing our internal review process. |
|
Please temporarily access this model via API: https://platform.sensenova.cn/doc?path=/chat/Embeddings/Embeddings.md |
|
There is a temporary problem with the API of this page. Please access it temporarily in the following way: |
|
```python |
|
import requests |
|
url = "http://103.237.28.72:8006/v1/qd" |
|
headers = { |
|
'Content-Type': 'application/json', |
|
'Accept': 'application/json' |
|
} |
|
data = { |
|
"inputs": ['hello,world'] |
|
} |
|
response = requests.post(url, json=data, headers=headers) |
|
print(response.json()) |
|
``` |
|
|
|
**[2024-05-14]** |
|
We have currently release our model weights, training code, and tech report. Discussions are welcome. |
|
For training code, please refer to our [github](https://github.com/hjq133/piccolo-embedding) |
|
For training details, please refer to our [tech-report](https://arxiv.org/abs/2405.06932) |
|
|
|
**[2024-04-22]** |
|
|
|
piccolo-large-zh-v2 currently ranks first on the C-MTEB list, leading the previous BERT model by about 1.9 points. |
|
|
|
## Piccolo-large-zh-v2 |
|
|
|
piccolo-large-zh-v2 is a Chinese embedding model developed by the general model group from SenseTime Research. This upgraded version of Piccolo aims to prioritize general downstream fine-tuning methods. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, |
|
effectively harnessing textual data and labels from diverse downstream |
|
tasks. In addition, Piccolo2 scales up the embedding dimension and uses |
|
MRL training to support more flexible vector dimensions. |
|
|
|
## 💡 Model Hightlights |
|
The main feature of piccolo2 is that it uses a multi-task hybrid loss during training. |
|
For retrieval/sorting tasks, we use the standard InfoNCE with in-batch-negative: |
|
<p align='left'> |
|
<img src='assets/1.png' width='400' height='80'> |
|
</p> |
|
|
|
For sts/pair classification tasks, we use cosent loss, which is proved to be better for data with more fine-grained labels(e.g. score values ): |
|
<p align='left'> |
|
<img src='assets/2.png' width='450' height='90'> |
|
</p> |
|
|
|
For classification/clustering tasks, by treating text and its semantic labels as positive and negative pairs, we convert the dataset into the format of triples. And then we use InfoNCE to optimize it. However, it’s important to |
|
stress that in-batch negatives are no longer used due to the fact that |
|
it can easily lead to conflict training targets: |
|
<p align='left'> |
|
<img src='assets/3.png' width='400' height='80'> |
|
</p> |
|
|
|
## 📃 Experiments and Results |
|
Piccolo2 primarily focuses on the downstream general finetune paradigm. Our open source model uses [stella-v3.5](https://huggingface.co./infgrad/stella-mrl-large-zh-v3.5-1792d) as initialization and trained about 2500 steps on 32 GPUS. For more implementation details, please refer to our [technical report](https://arxiv.org/abs/2405.06932). |
|
|
|
| Model Name | Model Size (GB) | Dimension | Sequence Length | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) | Average (35) | |
|
|:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
|
| [**piccolo-large-zh-v2**](https://huggingface.co./sensenova/piccolo-large-zh-v2) | 1.21 | 1792 | 512 | 74.59 | 62.17 | 90.24 | 70 | 74.36 | 63.5 | 70.95 | |
|
| [gte-Qwen1.5-7B-instruct](https://huggingface.co./Alibaba-NLP/gte-Qwen1.5-7B-instruct)| 26.45 | 32768 |4096 | 73.35 | 67.08 | 88.52 | 66.38 | 70.62 | 62.32 | 69.56| |
|
| [acge-text-embedding](https://huggingface.co./aspire/acge_text_embedding) |1.21 | 1792 | 512 | 72.75 | 58.7 | 87.84 | 67.98 | 72.93 | 62.09 | 69.07 | |
|
|
|
|
|
## 🔨 Usage |
|
The piccolo model can be easily accessed in the sentence-transformer package: |
|
```python |
|
# for s2s/s2p dataset, you can use piccolo as below |
|
from sklearn.preprocessing import normalize |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["数据1", "数据2"] |
|
matryoshka_dim=1792 # support 256, 512, 768, 1024, 1280, 1536, 1792 |
|
model = SentenceTransformer('sensenova/piccolo-large-zh-v2') |
|
embeddings_1 = model.encode(sentences, normalize_embeddings=False) |
|
embeddings_2 = model.encode(sentences, normalize_embeddings=False) |
|
embeddings_1 = normalize(embeddings_1[..., :matryoshka_dim], norm="l2", axis=1) |
|
embeddings_2 = normalize(embeddings_2[..., :matryoshka_dim], norm="l2", axis=1) |
|
similarity = embeddings_1 @ embeddings_2.T |
|
``` |
|
|
|
## 🤗 **Model List** |
|
| Model|Language|Description|prompt| |
|
|:-|:-:|:-:|:--:| |
|
| [sensenova/piccolo-large-zh-v2](https://huggingface.co./sensenova/piccolo-large-zh-v2) | Chinese | version2: finetuning with multi-task hybrid loss training | None | |
|
| [sensenova/piccolo-large-zh](https://huggingface.co./sensenova/piccolo-large-zh) | Chinese | version1: pretrain under 400 million chinese text pair | '查询'/'结果' | |
|
| [sensenova/piccolo-base-zh](https://huggingface.co./sensenova/piccolo-base-zh) | Chinese | version1: pretrain under 400 million chinese text pair | '查询'/'结果' | |
|
|
|
|
|
## Citation |
|
If you find our tech report, models or code helpful, please cite our report or give a star on github or huggingface! |
|
```bibtex |
|
@misc{2405.06932, |
|
Author = {Junqin Huang and Zhongjie Hu and Zihao Jing and Mengya Gao and Yichao Wu}, |
|
Title = {Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training}, |
|
Year = {2024}, |
|
Eprint = {arXiv:2405.06932}, |
|
} |
|
``` |