Finetuning BGE large increases score for false query
Hello, I have finetuned the model, using the training set as per given in their github repo (https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune), So in this for 5 sentences(s1,s2,s3,s4,s5) I have created a training set, in which I have used 5 query, its 1 positive and 5 negatives for each query.
Now after finetuning, when I calculate the cosine similarity score between the sentence similar to s2 and sentences which are similar to s3. Fundamentally the score has to go down. But the score getting increased.
For finetuning, we are using the below argument:
torchrun
-m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir /home/ankur/projects/nextiva_projects/vector_model_finetuning/bge_large_v1
--model_name_or_path BAAI/bge-large-en-v1.5
--train_data /home/ankur/projects/nextiva_projects/vector_model_finetuning/training_set.jsonl
--learning_rate 1e-5
--fp16
--save_strategy=epoch
--save_total_limit 1
--num_train_epochs 5
--per_device_train_batch_size 3
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 512
--train_group_size 6
--negatives_cross_device
--logging_steps 100
--report_to 'tensorboard'
--query_instruction_for_retrieval ""
So I want to know if I am doing something wrong or has to go with different approach
Sorry for the late reply, we just finished our holiday. Actually, we optimize the model based on contrastive loss, which increases the gap between the scores of positive pair and negative pair. This does not guarantee that the similarity of other samples will necessarily decrease. The evaluation of the retrieval model is generally based on the relative rather than absolute scores of samples (i.e., positive samples should rank higher than negative samples).
Besides, It will be helpful to analyze the reasons if you can provide some data samples.