google/paligemma-3b-mix-448

MoonRide

May 14

I just tried the demo available at https://huggingface.co./spaces/google/paligemma using one of my old works:

and I've got pretty weird output:

Response from Moondream2 model seem to be better, despite lower scores on the benchmarks:

Is it a limitation of the PaliGemma model, or something wrong with demo configuration?

merve

Google org May 14

•

edited May 14

hello 👋 @MoonRide I think the confusion comes from few things. firstly, mix checkpoints and moondream are not trained in the same fashion. this model is not going to give long answers, but rather shorter yet grounded answers, and if it cannot give a grounded answer, it will fallback to answers like "unanswerable" or the answer like above one.

Mix checkpoint is finetuned on a mix of benchmark datasets, e.g. coco captions, where the prompt is "caption", so when I pass that I get below answer.

Moreover, if you want longer answers you can finetune "pt" models on the dataset moondream was trained on. the main point of paligemma are providing finetuneable good base models. Although mix models have good zero shot capabilities (image captioning, segmentation, document tasks and more), they're not meant to be chatty.

merve changed discussion status to closed May 14

MoonRide

May 14

It wasn't really chatty prompt or answer (some big models tend to write really long answers, THOSE are chatty) - I just asked for a brief description. I hear what you're saying, but in my opinion good base model should see a bit wider range of both prompts and images. Clear instructions given in plain English should should be covered, ability to recognize content in digital art as well - those are pretty typical use cases - I wasn't trying to confuse the model, that was literally my 1st test image. Please consider this in future revisions.

google
/

paligemma-3b-mix-448

Weird output