Training Data and Distillation

#44
by kardosdrur - opened

Hi. I'm Márton Kardos, maintainer of MTEB.
I am writing to you as our recent efforts have been focused on being able to reliably indicate to our users whether models have been trained in-domain, or whether the scores on our benchmarks can be considered an accurate indication of a model's generalization performance.
As per your technical report, we have come to know that Jasper has been distilled from models, which were trained on multiple MTEB datasets, and have been able to annotate this in our model metadata.
Your report, however, does not indicate whether the Stella models were trained on MTEB datasets or were finetuned/distilled from models that were.
The Stella models deliver very similar performance to those models that have been finetuned on MTEB tasks, and it seems reasonable to assume that this is also the case for Stella.
As a fellow scholar, I assume that you have as strong a dedication to open science as I do, and it is in this spirit that I would like to ask you to disclose these details to us, and our and your users.
Thanks in advance, Márton

@infgrad

NovaSearch org
edited 3 days ago

@kardosdrur
Hi there
1)Jasper is distilled from stella_en_1.5B_v5 and nvidia/NV-Embed-v2
2)stella_en_1.5B_v5 is distilled from gte-Qwen2-7B-instruct and nvidia/NV-Embed-v1
3)When training (i.e. distillation) jasper and stella models, we only use unsupervised texts, however, these training texts may sightly overlap with MTEB sentences

So, I think the two models' zeroshot score (or ratio?) should be consistent with nvidia/NV-Embed-v1,nvidia/NV-Embed-v2 and gte-Qwen2-7B-instruct.

Thanks for MTEB maintainer's efforts, MTEB-2.0 is cool and hope it could be a perfect leaderboard >~<

Thanks for the quick reply and the kind words. We will add the annotations then :))

Can you confirm that you used the same teacher models for stella_en_400M_v5, stella-base-en-v2 and stella-mrl-large-zh-v3.5-1792d as well, or was your procedure different for these models?

NovaSearch org
edited 2 days ago

@kardosdrur Hi, I do a summarization:
jasper_en_vision_language_v1:based on stella_en_1.5B_v5 , distilled from stella_en_1.5B_v5 and nvidia/NV-Embed-v2,using unsupervised texts (e.g. wiki, finewebedu)
stella_en_400M_v5:based on Alibaba-NLP/gte-large-en-v1.5 , distilled from gte-Qwen2-7B-instruct and nvidia/NV-Embed-v1,using unsupervised texts
stella_en_1.5B_v5:based on Alibaba-NLP/gte-Qwen2-1.5B-instruct , distilled from gte-Qwen2-7B-instruct and nvidia/NV-Embed-v1,using unsupervised texts
stella-base-en-v2:based on BAAI/bge-base-en , distilled from BAAI/bge-base-en and intfloat/e5-base,using unsupervised texts
stella-mrl-large-zh-v3.5-1792d:sorry,forgot the details

Thanks for the summary! Do you have further info on the data composition for the Chinese Stella models?
I can see that you have uploaded the datasets to HF Hub, but do these by any chance overlap with C-MTEB?

Sign up or log in to comment