Can PaliGemma answer multiple questions about a single image?
Hello everyone,
I am interested in knowing if it is possible to ask multiple questions about a single image using this model.
Specifically, I am looking to:
Input an image into the model.
Ask several different questions related to the content of the image.
Receive accurate and contextually relevant answers for each question.
Has anyone tried this before? If so, could you please share your experience and any sample code or guidelines on how to achieve this? Any tips on optimizing the performance for such tasks would also be highly appreciated.
Thank you in advance for your help!
Here's how you can achieve this effectively (You can find example code for both approaches in this gist):
- Leverage Batch Processing:
prompts = [
'What is the year, make, and model of the car in the image?\n',
'What color is the house in the background, and how many doors you see?\n',
'What color is the doors?\n',
]
images = [image] * len(prompts) # Assuming the same image for all questions
model_inputs = processor(text=prompts, images=images, return_tensors="pt", padding=True, truncation=True).to(model.device)
input_len = model_inputs["input_ids"].shape[-1] # Get input length
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
# Assuming the model returns outputs for each prompt in the batch
generated_texts = generation[:, input_len:]
decoded_texts = [processor.decode(text, skip_special_tokens=True) for text in generated_texts]
for i, (prompt, decoded_text) in enumerate(zip(prompts, decoded_texts)):
print(f"Question: {prompt}")
print(f"Answer: {decoded_text}\n")
Output:
Question: What is the year, make, and model of the car in the image?
Answer: The year is 1965, the make is a Volkswagen Beetle, and the model is a 1965 Volkswagen Beetle.
Question: What color is the house in the background, and how many doors you see?
Answer: yellow 2
Question: What color is the doors?
Answer: brown
- Iteration
prompts = [
'What is the year, make, and model of the car in the image?\n',
'What color is the house in the background, and how many doors you see?\n',
'What color is the doors?\n',
]
images = [image] * len(prompts) # Assuming the same image for all questions
for prompt, image_data in zip(prompts, images):
model_inputs = processor(text=[prompt], images=[image_data], return_tensors="pt", padding=True).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(f"Question: {prompt}")
print(f"Answer: {decoded}\n")
Output:
Question: What is the year, make, and model of the car in the image?
Answer: The year is 1965, the make is a Volkswagen Beetle, and the model is a 1965 Volkswagen Beetle.
Question: What color is the house in the background, and how many doors you see?
Answer: yellow 2
Question: What color is the doors?
Answer: brown
Hello @selamw ,
Thank you so much for your detailed and insightful response to my question!)