MSMARCO numbers
Hi,
Thank you for sharing these very interesting results! I was wondering if you could share more details about your MSMARCO numbers? They seem to be really good in the 70s. I was wondering if you had done some analysis of where the gains come from? Also it looks like some of the metrics are very high like NDCG/MRR but others like MAP are lower than expected, do you have a sense of why this is the case? Thank you so much for sharing your inspiring work!
Hi,
This is a rellay excellent and thought-provoking discovery! I already pinned this discussion.
First, I can reproduce this result with a vector dimensions of 1024( the result in MTEB is 12288 dim, in order to save time, I use 1024, 1024d and 12288d almost have the same result )
Below is the json result after running MTEB scripts:
{
"dataset_revision": "c5a29a104738b98a9e76336939199e264163d4a0",
"evaluation_time": 6115.99347949028,
"kg_co2_emissions": null,
"mteb_version": "1.20.4",
"scores": {
"test": [
{
"hf_subset": "default",
"languages": [
"eng-Latn"
],
"main_score": 0.72314,
"map_at_1": 0.01742,
"map_at_10": 0.15734,
"map_at_100": 0.44579,
"map_at_1000": 0.5508,
"map_at_20": 0.25596,
"map_at_3": 0.05367,
"map_at_5": 0.09118,
"mrr_at_1": 0.9302325581395349,
"mrr_at_10": 0.9542635658914729,
"mrr_at_100": 0.9542635658914729,
"mrr_at_1000": 0.9542635658914729,
"mrr_at_20": 0.9542635658914729,
"mrr_at_3": 0.9496124031007752,
"mrr_at_5": 0.9542635658914729,
"nauc_map_at_1000_diff1": -0.46382975131440035,
"nauc_map_at_1000_max": 0.20438068926441116,
"nauc_map_at_1000_std": 0.4964733718614432,
"nauc_map_at_100_diff1": -0.23864316908809094,
"nauc_map_at_100_max": 0.09385294900690073,
"nauc_map_at_100_std": 0.15182362008938197,
"nauc_map_at_10_diff1": 0.12528818116788473,
"nauc_map_at_10_max": 0.007135244539555705,
"nauc_map_at_10_std": -0.21814958414226776,
"nauc_map_at_1_diff1": 0.17574364864826267,
"nauc_map_at_1_max": 0.031089573378313404,
"nauc_map_at_1_std": -0.18447533398842833,
"nauc_map_at_20_diff1": -0.011292227173401713,
"nauc_map_at_20_max": 0.052264325684664284,
"nauc_map_at_20_std": -0.12862222354835312,
"nauc_map_at_3_diff1": 0.19307799282362395,
"nauc_map_at_3_max": -0.018346889874611697,
"nauc_map_at_3_std": -0.2306024887686145,
"nauc_map_at_5_diff1": 0.24180343517914618,
"nauc_map_at_5_max": -0.04197180639575474,
"nauc_map_at_5_std": -0.27008599642691794,
"nauc_mrr_at_1000_diff1": -0.22421870351024387,
"nauc_mrr_at_1000_max": 0.35435194094510203,
"nauc_mrr_at_1000_std": 0.9055924233333185,
"nauc_mrr_at_100_diff1": -0.22421870351024387,
"nauc_mrr_at_100_max": 0.35435194094510203,
"nauc_mrr_at_100_std": 0.9055924233333185,
"nauc_mrr_at_10_diff1": -0.22421870351024387,
"nauc_mrr_at_10_max": 0.35435194094510203,
"nauc_mrr_at_10_std": 0.9055924233333185,
"nauc_mrr_at_1_diff1": -0.2768112574340929,
"nauc_mrr_at_1_max": 0.3278519250837586,
"nauc_mrr_at_1_std": 0.90716588294443,
"nauc_mrr_at_20_diff1": -0.22421870351024387,
"nauc_mrr_at_20_max": 0.35435194094510203,
"nauc_mrr_at_20_std": 0.9055924233333185,
"nauc_mrr_at_3_diff1": -0.25607369296482785,
"nauc_mrr_at_3_max": 0.4139502233194001,
"nauc_mrr_at_3_std": 0.9143069688717791,
"nauc_mrr_at_5_diff1": -0.22421870351024387,
"nauc_mrr_at_5_max": 0.35435194094510203,
"nauc_mrr_at_5_std": 0.9055924233333185,
"nauc_ndcg_at_1000_diff1": -0.6005450468612303,
"nauc_ndcg_at_1000_max": 0.4023700863831518,
"nauc_ndcg_at_1000_std": 0.5824218812857368,
"nauc_ndcg_at_100_diff1": -0.3528509721052785,
"nauc_ndcg_at_100_max": 0.2642667625048037,
"nauc_ndcg_at_100_std": 0.46975787771062294,
"nauc_ndcg_at_10_diff1": -0.29050921109285754,
"nauc_ndcg_at_10_max": 0.17960600600420823,
"nauc_ndcg_at_10_std": 0.30603978866864473,
"nauc_ndcg_at_1_diff1": 0.09027283415274971,
"nauc_ndcg_at_1_max": 0.030625180613426704,
"nauc_ndcg_at_1_std": 0.042087695534755575,
"nauc_ndcg_at_20_diff1": -0.3887981422052685,
"nauc_ndcg_at_20_max": 0.20863639931858355,
"nauc_ndcg_at_20_std": 0.4032151965596434,
"nauc_ndcg_at_3_diff1": -0.15287877156536792,
"nauc_ndcg_at_3_max": 0.13673715469761652,
"nauc_ndcg_at_3_std": 0.21340771876085632,
"nauc_ndcg_at_5_diff1": -0.18407515342968886,
"nauc_ndcg_at_5_max": 0.11257329740715624,
"nauc_ndcg_at_5_std": 0.20672689814830397,
"nauc_precision_at_1000_diff1": -0.3893691517124962,
"nauc_precision_at_1000_max": 0.10841773666253955,
"nauc_precision_at_1000_std": 0.5783899080310474,
"nauc_precision_at_100_diff1": -0.4168795550991344,
"nauc_precision_at_100_max": 0.13002778112156696,
"nauc_precision_at_100_std": 0.5588993748274989,
"nauc_precision_at_10_diff1": -0.6753350299914685,
"nauc_precision_at_10_max": 0.24844590101086228,
"nauc_precision_at_10_std": 0.6075804599231396,
"nauc_precision_at_1_diff1": -0.2768112574340929,
"nauc_precision_at_1_max": 0.3278519250837586,
"nauc_precision_at_1_std": 0.90716588294443,
"nauc_precision_at_20_diff1": -0.6666884921852013,
"nauc_precision_at_20_max": 0.20992300646288067,
"nauc_precision_at_20_std": 0.6158858644846663,
"nauc_precision_at_3_diff1": -0.7522733630279954,
"nauc_precision_at_3_max": 0.10468644981716246,
"nauc_precision_at_3_std": 0.7630768529734518,
"nauc_precision_at_5_diff1": -0.632143158865704,
"nauc_precision_at_5_max": 0.12457656992876864,
"nauc_precision_at_5_std": 0.6339430282104572,
"nauc_recall_at_1000_diff1": -0.7580647139161585,
"nauc_recall_at_1000_max": 0.49445982651300185,
"nauc_recall_at_1000_std": 0.5646676647070854,
"nauc_recall_at_100_diff1": -0.124525957761531,
"nauc_recall_at_100_max": 0.10743460524407461,
"nauc_recall_at_100_std": 0.02245542586035658,
"nauc_recall_at_10_diff1": 0.18864451555135933,
"nauc_recall_at_10_max": -0.027409424878224002,
"nauc_recall_at_10_std": -0.2912943612875681,
"nauc_recall_at_1_diff1": 0.17574364864826267,
"nauc_recall_at_1_max": 0.031089573378313404,
"nauc_recall_at_1_std": -0.18447533398842833,
"nauc_recall_at_20_diff1": 0.057125191089183236,
"nauc_recall_at_20_max": 0.025804432942512313,
"nauc_recall_at_20_std": -0.20636263042612146,
"nauc_recall_at_3_diff1": 0.23459649268734029,
"nauc_recall_at_3_max": -0.04524567876879379,
"nauc_recall_at_3_std": -0.2793965971903096,
"nauc_recall_at_5_diff1": 0.316326752184651,
"nauc_recall_at_5_max": -0.09275960366062967,
"nauc_recall_at_5_std": -0.3536464511098924,
"ndcg_at_1": 0.74031,
"ndcg_at_10": 0.72314,
"ndcg_at_100": 0.67733,
"ndcg_at_1000": 0.76798,
"ndcg_at_20": 0.71012,
"ndcg_at_3": 0.74259,
"ndcg_at_5": 0.74139,
"precision_at_1": 0.93023,
"precision_at_10": 0.8186,
"precision_at_100": 0.41279,
"precision_at_1000": 0.08023,
"precision_at_20": 0.75,
"precision_at_3": 0.89147,
"precision_at_5": 0.87907,
"recall_at_1": 0.01742,
"recall_at_10": 0.17388,
"recall_at_100": 0.56114,
"recall_at_1000": 0.85412,
"recall_at_20": 0.28763,
"recall_at_3": 0.0572,
"recall_at_5": 0.10274
}
]
},
"task_name": "MSMARCO"
}
I was wondering if you could share more details about your MSMARCO numbers?
I am confused about ' MSMARCO numbers', what do you mean by saying 'numbers'?
Do you want to know the number of test set? The number of test set can be found in https://huggingface.co./datasets/mteb/msmarco .
The MTEB test script does not show the number, but I can calculate the amount of test data based on the test speed and test time, and the result is consistent with the official test set.
I have carefully studied the formulae for these metrics, and then I found out the score of rank metrics (NDCG, MRR) is high, and the score of precision metrics (Recall, MAP) is low.
I check the result of jasper's base model (stela_en_1.5B_v5), and do not find this phenomenon.
The difference between jasper and stella is that the jasper model use a new rank loss created by me, the core idea of this rank loss is let student model to learn the triplet data generated by teacher model. The loss function is margin loss with a margin of 0.015.
I surmised that it could be caused by this loss.
I also find that the model abhinand/MedEmbed-small-v0.1 has the same phenomenon, if you are interested, you can learn more about the details of their implementation.
This interesting phenomenon indicates that:
- The mteb metrics may need to be more robust
- Different training methods may have different results on different ranking metrics, this might be a point worth looking into.
Finally, I can't accurately explain this phenomenon, so I hope the results I've provided so far are somewhat enlightening to you
会不会是数据集split的问题?MSMARCO使用dev,而不是test来测试。
How about the split set? MSMARCO uses dev split instead of test split.
会不会是数据集split的问题?MSMARCO使用dev,而不是test来测试。
How about the split set? MSMARCO uses dev split instead of test split.
May be, we need a test on MSMARCO-dev-split. 但是我没有卡用了......你有兴趣可以试一试,不过我觉得MSMARCO是被使用过很多很多次的经典数据集了,大概率不会出问题。
会不会是数据集split的问题?MSMARCO使用dev,而不是test来测试。
How about the split set? MSMARCO uses dev split instead of test split.May be, we need a test on MSMARCO-dev-split. 但是我没有卡用了......你有兴趣可以试一试,不过我觉得MSMARCO是被使用过很多很多次的经典数据集了,大概率不会出问题。
其实我之前跑出过这个问题,在test-split上60多,dev上40多;我现在也不太有卡🤦🏻♀️
@Kaguya-19 好吧,学习了。刚刚去看了下abhinand/MedEmbed-small-v0.1,这个模型有同样的问题,我去看了下他们的report,使用的triplet data做训练,估计也和这种loss有关系。
会不会是数据集split的问题?MSMARCO使用dev,而不是test来测试。
How about the split set? MSMARCO uses dev split instead of test split.May be, we need a test on MSMARCO-dev-split. 但是我没有卡用了......你有兴趣可以试一试,不过我觉得MSMARCO是被使用过很多很多次的经典数据集了,大概率不会出问题。
其实我之前跑出过这个问题,在test-split上60多,dev上40多;我现在也不太有卡🤦🏻♀️
经朋友验证,是本人测试的mteb版本有个bug,导致用了test-split,才导致ndcg很高,已经更新了分数了。
这个是修复bug的pr: https://github.com/embeddings-benchmark/mteb/issues/1608
Hi, everyone,
The reason is the version of mteb library has a small debug which use msmarco-test as the test data, that is why my NDCG score is so high.
I have update the mteb results and it looks normal.
The pr to fix this bug: https://github.com/embeddings-benchmark/mteb/issues/1608