3 1

Mike

mw44

AI & ML interests

None yet

Recent Activity

commented on an article 2 days ago

SigLIP 2: A better multilingual vision language encoder

commented on an article 4 days ago

SigLIP 2: A better multilingual vision language encoder

commented on an article 4 days ago

SigLIP 2: A better multilingual vision language encoder

View all activity

Organizations

None yet

mw44's activity

commented on SigLIP 2: A better multilingual vision language encoder 2 days ago

I think youre on the right track with this @giffmana . But it suggests the transformers implementation is doing this incorrectly, right?

As it stands, I'm getting really strange and objectively bad results when comparing siglip1 to siglip2, even when using the official Google space for it (https://huggingface.co./spaces/google/zero-shot-sg1-sg2/blob/main/app.py) which seems to stop after sigmoid. Siglip2 is consistently less confident in its predictions (when computing the confidences in this way) and is highly sensitive to, for instance, removing the '.' from the end of the label transform e.g. texts = [f"This is a photo of {label}" for label in candidate_labels] vs texts = [f"This is a photo of {label}." for label in candidate_labels] which does not seem right at all.

commented on SigLIP 2: A better multilingual vision language encoder 4 days ago

Just wanted to add that I found the github readme on transformers that shows how to perform pre and post-processing yourself using AutoModel and AutoProcessor. I noted that this example performs torch.sigmoid on the raw model output, which leaves the values looking similar to how they look when running it via the pipeline.

Following the github example almost exactly with my dog image I can see, for labels:
man, cat, horse, dog (with template This is a photo of a xxxx.)
Raw logits: tensor([[-16.1657, -14.3962, -15.7023, -7.3122]])
Post sigmoid: tensor([[9.5352e-08, 5.5954e-07, 1.5156e-07, 6.6692e-04]])
0.0% that image 0 is 'man'
0.0% that image 0 is 'cat'
0.0% that image 0 is 'horse'
0.1% that image 0 is 'dog'

commented on SigLIP 2: A better multilingual vision language encoder 4 days ago

Could you install transformers from main?

pip install git+https://github.com/huggingface/transformers@main

Let us know if this solves the issue. 🤗

Hi @ariG23498 and @giffmana , this seems like it was probably the issue w/ the padding and I apologize for that - had to be on dev for another model to work recently. I will say, something still seems off. The scores (logits?) come back as tiny e-08 values. In order to make them look at all like probabilities I've had to scale them and apply softmax. It's also getting very easy questions wrong:

Pipeline output for a clear image of a dog:

[[{'score': 8.585556088291924e-08, 'label': 'plant'}, {'score': 6.025020837796546e-08, 'label': 'dog'}, {'score': 4.270815523454985e-08, 'label': 'woman'}, {'score': 2.5436479589302508e-08, 'label': 'cat'}, {'score': 1.708304253611459e-08, 'label': 'man'}]]

This is using google/siglip2-so400m-patch14-384.

After softmax:

Even when it's correct, the confidence seems very low, which makes me think softmax is not how you are supposed to transform the results.

Pipeline output:

[[{'score': 6.025032917023054e-08, 'label': 'dog'}, {'score': 5.832493954471829e-08, 'label': 'horse'}, {'score': 2.5436429851311004e-08, 'label': 'cat'}]]

After softmax:

Thanks for any tips, sorry I'm kinda new at this!

commented on SigLIP 2: A better multilingual vision language encoder 4 days ago

This comment has been hidden

commented on SigLIP 2: A better multilingual vision language encoder 5 days ago

all good i appreciate the effort!

commented on SigLIP 2: A better multilingual vision language encoder 5 days ago

The warning gives you the answer: pass max_length=64

I've been passing a kwarg to pipe for max_length the whole time, and it doesn't propagate to the tokenizer, and the warning and error persist. Furthermore, if this is required, why is it not shown in the example? Why does the example work without specifying anything at all?

The only way i've managed to get this to work is by modifying zero_shot_image_classification.py myself, on the lines indicated above, by adding:

if "max_length" not in tokenizer_kwargs:
            tokenizer_kwargs["max_length"] = 64

commented on SigLIP 2: A better multilingual vision language encoder 6 days ago

I tried running the zeroshot classification example and got ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (input_idsin this case) have excessive nesting (inputs typelistwhere typeintis expected). transformers version=4.49.0.dev0

I tried adding both padding=True and truncation=True to no avail. i also tried padding="max_length"

EDIT:
it seems to work if my labels are all the same length. doing some debugging, i see that in zero_shot_image_classification.py, the padding provided to the tokenizer is forced to be max_length anyway here (L148-149)

padding = "max_length" if self.model.config.model_type == "siglip" else True
text_inputs = self.tokenizer(sequences, return_tensors=self.framework, padding=padding, **tokenizer_kwargs)

and yet, if my labels have variable lengths, the outputs are not the same length, and so calling torch.tensor on that ultimately fails
i did spot this warning in my terminal as well:
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.

New activity in allenai/Molmo-7B-D-0924 2 months ago