Finalize Sentence Transformers integration

by tomaarsen HF staff - opened 4 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+101

-38

tomaarsen

4 days ago

•

edited 4 days ago

Hello!

Thanks @Samoed for setting up this work.

Pull Request overview

Add modules.json to inform Sentence Transformers what modules to use
- It points to the new MultiModalTransformer module, but also normal Pooling and Normalize modules
Configure the Pooling layer to use last token pooling
Add the Sentence Transformers usage to the README

Details

I've updated the usage compared to @Samoed 's suggestion: instead of always specifying both a type and a content, users can now specify a dictionary with text, image, or both. Still, as @intfloat mentioned, you cannot embed both image-text and pure text at the same time, but I think that's not a big problem. As before, just using "A cat and a dog" instead of {"text": "A cat and a dog"} also works, but in the README I want to inform users of {"text": "A cat and a dog"} so they (hopefully) understand how it can be expanded to also include an image in that embedding.

I want to share that I was unable to get 100% correspondance between Transformers and Sentence Transformers, annoyingly enough. The problem is somewhere in the processing or model forward, not the pooling or normalization, and the difference is quite small. For example, the mean and std of the model last hidden states are identical to at least 4 significant digits. The cosine similarity of the Image + Text -> Text README examples is [0.3965, 0.3105] for transformers and [0.3967, 0.3090] for Sentence Transformers. The Text -> Image examples has [0.4219, 0.3887] for transformers and [0.4250, 0.3896] for Sentence Transformers.

Tom Aarsen

Finalize Sentence Transformers integration501f1f44

tomaarsen changed pull request title from Finalize Sentence Transforemrs integration to Finalize Sentence Transformers integration 4 days ago

Remove unneeded spaces before texts in image-text setups211fa5c2

tomaarsen changed pull request status to open 4 days ago

Samoed

4 days ago

Great! Can you also update custom_st file, because it seems that it would require to pass <|begin_of_text|> after an image. https://github.com/haon-chen/mmE5/issues/1#issuecomment-2678420039

tomaarsen

4 days ago

I add it automatically if there is both an image and text: https://huggingface.co./intfloat/mmE5-mllama-11b-instruct/blob/refs%2Fpr%2F3/custom_st.py#L71-L72

Should <|begin_of_text|> always be included, even if there is only an image and no text?
And should it be included if there is no image, but only text?

Also, I remove spaces from the left of the text if there is no image, and I make sure there is 1 space of text between <|begin_of_text|> and the text if there is an image. Is that correct?

Tom Aarsen

intfloat

Owner 3 days ago

•

edited 1 day ago

Hello @tomaarsen !

Thanks for the great work!

I know it feels unnecessary, but currently the model is trained in this way.

And should it be included if there is no image, but only text?
No, as the example of encoding A cat and a dog shows, there is no need to include <|begin_of_text|> when only text.

Also, I remove spaces from the left of the text if there is no image, and I make sure there is 1 space of text between <|begin_of_text|> and the text if there is an image. Is that correct?
~~Yes, that is correct.~~ In fact, no need to add extra space anyways, the previous demo is buggy.

intfloat changed pull request status to merged 3 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment