I think youre on the right track with this
@giffmana
. But it suggests the transformers implementation is doing this incorrectly, right?
As it stands, I'm getting really strange and objectively bad results when comparing siglip1 to siglip2, even when using the official Google space for it (https://huggingface.co./spaces/google/zero-shot-sg1-sg2/blob/main/app.py) which seems to stop after sigmoid
. Siglip2 is consistently less confident in its predictions (when computing the confidences in this way) and is highly sensitive to, for instance, removing the '.' from the end of the label transform e.g. texts = [f"This is a photo of {label}" for label in candidate_labels]
vs texts = [f"This is a photo of {label}." for label in candidate_labels]
which does not seem right at all.
Mike
AI & ML interests
Recent Activity
Organizations
mw44's activity
Just wanted to add that I found the github readme on transformers
that shows how to perform pre and post-processing yourself using AutoModel
and AutoProcessor
. I noted that this example performs torch.sigmoid
on the raw model output, which leaves the values looking similar to how they look when running it via the pipeline.
Following the github example almost exactly with my dog image I can see, for labels:
man, cat, horse, dog (with template This is a photo of a xxxx.)
Raw logits: tensor([[-16.1657, -14.3962, -15.7023, -7.3122]])
Post sigmoid: tensor([[9.5352e-08, 5.5954e-07, 1.5156e-07, 6.6692e-04]])
0.0% that image 0 is 'man'
0.0% that image 0 is 'cat'
0.0% that image 0 is 'horse'
0.1% that image 0 is 'dog'
Could you install transformers from main?
pip install git+https://github.com/huggingface/transformers@main
Let us know if this solves the issue. ๐ค
Hi @ariG23498 and @giffmana , this seems like it was probably the issue w/ the padding and I apologize for that - had to be on dev for another model to work recently. I will say, something still seems off. The scores (logits?) come back as tiny e-08 values. In order to make them look at all like probabilities I've had to scale them and apply softmax. It's also getting very easy questions wrong:
Pipeline output for a clear image of a dog:
[[{'score': 8.585556088291924e-08, 'label': 'plant'}, {'score': 6.025020837796546e-08, 'label': 'dog'}, {'score': 4.270815523454985e-08, 'label': 'woman'}, {'score': 2.5436479589302508e-08, 'label': 'cat'}, {'score': 1.708304253611459e-08, 'label': 'man'}]]
This is using google/siglip2-so400m-patch14-384
.
Even when it's correct, the confidence seems very low, which makes me think softmax is not how you are supposed to transform the results.
Pipeline output:
[[{'score': 6.025032917023054e-08, 'label': 'dog'}, {'score': 5.832493954471829e-08, 'label': 'horse'}, {'score': 2.5436429851311004e-08, 'label': 'cat'}]]
Thanks for any tips, sorry I'm kinda new at this!
all good i appreciate the effort!
The warning gives you the answer: pass max_length=64
I've been passing a kwarg to pipe
for max_length
the whole time, and it doesn't propagate to the tokenizer, and the warning and error persist. Furthermore, if this is required, why is it not shown in the example? Why does the example work without specifying anything at all?
The only way i've managed to get this to work is by modifying zero_shot_image_classification.py
myself, on the lines indicated above, by adding:
if "max_length" not in tokenizer_kwargs:
tokenizer_kwargs["max_length"] = 64
I tried running the zeroshot classification example and got ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (
input_idsin this case) have excessive nesting (inputs type
listwhere type
intis expected).
transformers version=4.49.0.dev0
I tried adding both padding=True and truncation=True to no avail. i also tried padding="max_length"
EDIT:
it seems to work if my labels are all the same length. doing some debugging, i see that in zero_shot_image_classification.py
, the padding provided to the tokenizer is forced to be max_length
anyway here (L148-149)
padding = "max_length" if self.model.config.model_type == "siglip" else True
text_inputs = self.tokenizer(sequences, return_tensors=self.framework, padding=padding, **tokenizer_kwargs)
and yet, if my labels have variable lengths, the outputs are not the same length, and so calling torch.tensor on that ultimately fails
i did spot this warning in my terminal as well:Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.