Results from the SentenceTransformer library are sometimes different than the ones imported from the FlagModel
Long sentences or paragraphs seem to be broken apart in the FlagModel implementation and then embedded individually, resulting in an output that contains 2 or more lists for one input. This is not true for the implementation via SentenceTransformer. Here one input leads to one output, containing one embedding.
We don't set the tokenizer to split the long sentence. I cannot reproduce this error, can you help to provide some scripts?
Ofc,
text = "In a bustling town where shadows whispered secrets, a cat named Mira gained the ability to speak human language. One evening, Mira whispered a forgotten legend about a hidden treasure beneath the town's oldest tree to a young, curious adventurer. Together, they embarked on a moonlit quest, forging an unbreakable bond while unearthing mysteries of the past."
from FlagEmbedding import FlagModel
query_instruction_for_retrieval='Represent this sentence for searching relevant passages: '
model = FlagModel('BAAI/bge-large-en', query_instruction_for_retrieval=query_instruction_for_retrieval)
model.encode(text)
The output is:
array([[ 0.01307663, 0.01230509, -0.02265706, ..., 0.01391939,
-0.03492293, -0.00590275],
[ 0.00093336, 0.02802942, -0.03854717, ..., -0.0290005 ,
-0.00743153, 0.00360901]], dtype=float32)
If I run:
from sentence_transformers import SentenceTransformer
instruction = query_instruction_for_retrieval
model2 = SentenceTransformer('BAAI/bge-large-en')
p_embeddings = model2.encode(text, normalize_embeddings=True)
The output is:
array([ 0.01453966, 0.01963736, -0.02570449, ..., 0.01102038,
-0.03386544, -0.00838666], dtype=float32)
Hope that is helpful!
Thanks!
The FlagEmbedding doesn't support inputting a string, so it makes this error.
We have updated the FlagEmbedding repo, and you can install it:
pip install -U FlagEmbedding
Perfect! Thank you
Can I ask another question: In what way would I best compare to sentences with each other. I have requirements and skills of people and I want to find out if a person has the skill required. However, I have realized that the best matches oftentimes depend more on the similarity of the sentence structure (length, gramma etc.) than on the actual content. What instruction would you recommend using ur model, and do you have any general tips?
If you need to search the answer to a short query, you need to add provided instruction to the query; in other cases, no instruction is needed, just use the original query directly.bge
models focus on the general ability. Since your scenario is different from the classical retrieval task or similarity task, It's better to fine-tune it based on your data, you can use this tool fine-tune it.
Thank you so much! That is incredible helpful – and I see Github is now also online :)
Besides, you can select some negatives which have the same sentence structure as your sentences to let the model depends more on the actual content
Ye, I agree – can probably create very diverse sentences in terms of structure/ length and make only the meaning stand out. I let you know if it works :d thanks again – really cool project!
Would I use for the fine-tuning some kind of prompt: 'Create this sentence so that it can be compared in meaning with other sentences'. Similar to how you use the "Represent this sentence for searching relevant passages:" for retrieval.
You can try both using prompt and not using prompt. In fact, I'm not sure which is better for your task.
Ok, will try. Thanks again