Trying IDEFICS on a New Yorker cartoon dataset
Earlier this year, I generated descriptions of cartoons using Salesforce's LAVIS library and instruct/multimodal model. Their concept was to combine a visual encoder (BLIP-2), Vicuna (LLaMa v1 + delta weights), and additional InstructBLIP weights. This was a bit tedious because of the limited release of LLaMa v1 and the need to re-assemble the model in memory.
IDEFICS is a complete multimodal instruct model from HuggingFace, and I'm excited to try it out on the same task.
First I got a A100 CoLab Pro notebook, and loaded the model as described on the IDEFICS blog post:
from transformers import IdeficsForVisionText2Text, AutoProcessor
model = IdeficsForVisionText2Text.from_pretrained(
"HuggingFaceM4/idefics-9b-instruct",
torch_dtype=torch.bfloat16,
).to(device)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-9b-instruct")
Here's how to grab the New Yorker cartoons and caption-matching dataset:
from datasets import load_dataset
cartoons = load_dataset("jmhessel/newyorker_caption_contest", "matching")
I can grab a single PIL image from the dataset like this:
image = None
for cartoon in cartoons['train']:
image = cartoon['image']
#print(cartoon['caption_choices'])
break
image # at end if you're doing CoLab, to see the image yourself
In the previous project I had time to think about prompts. Most image captioning examples use a variation of "Describe the image in detail" - but because this already has a distinct style as a New Yorker cartoon, you usually just get those facts in each responses. If we want enough information to provide hints and make it easier to pick a matching joke or caption for the image, I state that it's a cartoon, ask for specifics, and hint at an interesting premise.
Here's the prompt in instruct / chat format:
prompts = [
[
"User: Describe all characters and setting of this cartoon in detail. It may be sardonic or absurdist.",
image,
"<end_of_utterance>",
],
]
Let's follow through with the rest of the generation code from the HF blog:
inputs = processor(prompts, return_tensors="pt").to(device)
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
generated_ids = model.generate(**inputs, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
print(f"{i}:\n{t}\n")
The result:
User: Describe all characters and setting of this cartoon in detail. It may be sardonic or absurdist.
Assistant: The image features two giraffes standing in a living room. One giraffe is sitting on a couch, while the other is standing near a coffee table. The living room is furnished with a TV mounted on the wall, a lamp on a side table, and a potted plant on the floor
Passes the vibe check!
CoLab link: https://colab.research.google.com/drive/15kd17YRdbVayggA-ZCYiXTYzZG4w8zUd?usp=sharing
Future work
Admittedly the first few cartoon captions were not as accurate, so I did a little cherry-picking, but they were all in the right zone. I think there's opportunity to reword the prompt or do a few-shot example with this chat-like format.