Update README.md
Browse files
README.md
CHANGED
@@ -1063,14 +1063,16 @@ piccolo是一个通用embedding模型, 由来自商汤科技的通用模型组
|
|
1063 |
目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
|
1064 |
|
1065 |
piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
|
1066 |
-
|
1067 |
and train the model with the pair(text and text pos) softmax contrastive loss.
|
1068 |
On the second stage, we collect 20 million human labeled chinese text pairs from the open-source dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
|
1069 |
Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
|
1070 |
|
1071 |
## Metric
|
1072 |
我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在eval文件夹中提供了复现结果的脚本。
|
1073 |
-
|
|
|
|
|
1074 |
|
1075 |
|
1076 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
|
|
|
1063 |
目前,我们提供了piccolo-base-zh和piccolo-large-zh两个模型。
|
1064 |
|
1065 |
piccolo is a general text embedding model, powered by General Model Group from SenseTime Research.
|
1066 |
+
Inspired from E5 and GTE, piccolo is trained using a two stage pipeline. On the first stage, we collect and crawl 400 million weakly supervised Chinese text pairs from the Internet,
|
1067 |
and train the model with the pair(text and text pos) softmax contrastive loss.
|
1068 |
On the second stage, we collect 20 million human labeled chinese text pairs from the open-source dataset, and finetune the model with tiplet (text, text_pos, text_neg) contrastive loss.
|
1069 |
Currently here we offer two different sizes of models, including piccolo-base-zh, piccolo-large-zh.
|
1070 |
|
1071 |
## Metric
|
1072 |
我们将piccolo与其他的开源embedding模型在CMTEB榜单上进行了比较,请参考CMTEB榜单。我们在eval文件夹中提供了复现结果的脚本。
|
1073 |
+
|
1074 |
+
We compared the performance of the piccolo with other embedding models on the C-MTEB benchmark. please refer to the C-MTEB leaderboard.
|
1075 |
+
we provide scripts in "eval" folder for results reproducing.
|
1076 |
|
1077 |
|
1078 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
|