google/paligemma-3b-mix-224 · Can PaliGemma answer multiple questions about a single image?

Jun 5, 2024

Hello everyone,

I am interested in knowing if it is possible to ask multiple questions about a single image using this model.

Specifically, I am looking to:

Input an image into the model.
Ask several different questions related to the content of the image.
Receive accurate and contextually relevant answers for each question.
Has anyone tried this before? If so, could you please share your experience and any sample code or guidelines on how to achieve this? Any tips on optimizing the performance for such tasks would also be highly appreciated.

Thank you in advance for your help!

selamw

Google org Jun 6, 2024

Here's how you can achieve this effectively (You can find example code for both approaches in this gist):

Leverage Batch Processing:

prompts = [
    'What is the year, make, and model of the car in the image?\n',
    'What color is the house in the background, and how many doors you see?\n',
    'What color is the doors?\n',
]
images = [image] * len(prompts)  # Assuming the same image for all questions

model_inputs = processor(text=prompts, images=images, return_tensors="pt", padding=True, truncation=True).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]  # Get input length

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    # Assuming the model returns outputs for each prompt in the batch
    generated_texts = generation[:, input_len:]
    decoded_texts = [processor.decode(text, skip_special_tokens=True) for text in generated_texts]

for i, (prompt, decoded_text) in enumerate(zip(prompts, decoded_texts)):
    print(f"Question: {prompt}")
    print(f"Answer: {decoded_text}\n")

Output:

Question: What is the year, make, and model of the car in the image?
Answer: The year is 1965, the make is a Volkswagen Beetle, and the model is a 1965 Volkswagen Beetle.

Question: What color is the house in the background, and how many doors you see?
Answer: yellow 2

Question: What color is the doors?
Answer: brown

Iteration

prompts = [
    'What is the year, make, and model of the car in the image?\n',
    'What color is the house in the background, and how many doors you see?\n',
    'What color is the doors?\n',
]
images = [image] * len(prompts)  # Assuming the same image for all questions

for prompt, image_data in zip(prompts, images):
  model_inputs = processor(text=[prompt], images=[image_data], return_tensors="pt", padding=True).to(model.device)
  input_len = model_inputs["input_ids"].shape[-1]

  with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(f"Question: {prompt}")
    print(f"Answer: {decoded}\n")

Output:

Question: What is the year, make, and model of the car in the image?
Answer: The year is 1965, the make is a Volkswagen Beetle, and the model is a 1965 Volkswagen Beetle.

Question: What color is the house in the background, and how many doors you see?
Answer: yellow 2

Question: What color is the doors?
Answer: brown

makemecker

Jun 6, 2024

Hello @selamw ,

Thank you so much for your detailed and insightful response to my question!)

makemecker changed discussion status to closed Jun 6, 2024