Messages about new models and report
Hi, everyone, thanks for using stella models.
After six months of work, I trained the jasper model on top of the stella model, which is a multimodal model, and it can be ranked 2 in mteb (submitted the results on 2024-12-11, which may need official review https://github.com/embeddings-benchmark/results/pull/68).
Model link: https://huggingface.co./infgrad/jasper_en_vision_language_v1
I'll focus on the technical report, training data and related code, hopefully the tricks I've used will be of some help to you guys!
This work was accomplished during my free time, it's a personal hobby. One person's time and energy is limited, and you are welcome to make any contributions!
Could you explain in your paper how you obtained the dunzhang/stella_en_400M_v5 model ? Is it pure distillation ? or did you retrain using contrastive loss like INFONCe ? I would like to reproduce this model scaling down the number of layers to make a smaller model.
Thanks
Hi,
@claeyzre
, Thank you for being interesting.
dunzhang/stella_en_400M_v5: first distilled by about 100M unsupervised data, then retrain using contrastive loss.
I think distillation may be a good pretraining method, and then you can fine-tune it on your specific data.
The distillation tricks are too much to write them all in this report, and I've even forgotten some of them! 😂 Anyway, if you have any questions about your reproduction, please do not hesitate to contact me; I will do my best to help you.
Hello,
I am trying to fine-tune my own pre-trained model with 300M parameters (https://huggingface.co./keeeeenw/MicroLlama) for text embedding.
Based on your replies above,
Is https://www.sbert.net/examples/training/distillation/README.html#knowledge-distillation a good starting point for distillation? What would be a good teach model?
Is there a quick starter code for contrastive loss? It looks like you are also using instruction for the embedding model so is https://www.sbert.net/examples/training/prompts/README.html a good starting point?
If you don't have the time to answer these questions, I am looking forward to learning more about these details in your technical report / training code.
Thanks!
Hi, @keeeeenw
https://www.sbert.net/examples/training/distillation/README.html#knowledge-distillation is a good starting point for distillation, but their method is different about mine, it can give you a better understanding about distillation
If your teacher model is A and B, then A or B will be the best choice. If A or B is too large for you, you can use a smaller vector model trained with the same data as A or B. If there is no such model, just select a vector model.
As for the starter code for contrastive loss. https://github.com/NLPJCL/RAG-Retrieval may be a choice.
Understood! Thanks for the detailed explanations and thanks for sharing the link. I will study https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding more carefully.
Hi @infgrad ,
dunzhang/stella_en_400M_v5: first distilled by about 100M unsupervised data, then retrain using contrastive loss.
I think distillation may be a good pretraining method, and then you can fine-tune it on your specific data.
-> This is interesting because usually it's the other way around: first unsupervised contrastive training, then distillation.
The distillation tricks are too much to write them all in this report, and I've even forgotten some of them! 😂 Anyway, if you have any questions about your reproduction, please do not hesitate to contact me; I will do my best to help you.
Too bad :( . Do you have the distillation code somewhere ? I don't mind if it's dirty, I'd be glad to help you making it clearer in your repository.
Thank you very much !
@infgrad
my email is ***
I am interested particularly in Stella. Could you send me both the code for Stella and Jasper ?
Thanks !
@infgrad , is it possible to send the training code for Stella to me as well? I am also interested in learning more about your distillation process as well as your data processing pipeline. Your code for https://github.com/NLPJCL/RAG-Retrieval/tree/master/rag_retrieval/train/embedding is very easy to follow, but I cannot do much with the T2 sample training data or T2 data in general for mteb English rankings because it is in Chinese. I assume you are using a set of different datasets for distillation as well as contrastive loss. My email is (Email removed. Thanks for sharing your code!)
Next week, I will try to find some useful scripts and upload to https://huggingface.co./infgrad/jasper_en_vision_language_v1.
@infgrad thanks for sharing your code. This is very cool! So essentially you are using filtered web passages for 80% of your training data and you are using the teacher model to compute your positive/negative labels for the rank loss (https://huggingface.co./infgrad/jasper_en_vision_language_v1/blob/main/scripts/original_stella_jasper_training_codes/run_train_distill_stage1.py#L353).
This method is very generalizable because you don't need any annotated dataset with pre-computed positive/negative labels. Before looking at your code, I was under the impression that you would need to pre-compute this positive/negative label separately using another model, but it looks like you can compute the loss in a single pass with the two other losses. I wonder if NV-EMBED V2 uses a similar method as described in the "4.1.1 HARDNEGATIVE MINING TECHNIQUE" of their paper (https://arxiv.org/pdf/2405.17428). (https://arxiv.org/pdf/2405.17428).
@infgrad
thank you so much !!!
For reproduction purposes, do you have the yaml files used for the conf
objects for the 3 stages ?