YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models
Abstract
Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.
Community
🎉 Exciting News! 🎉
Our paper "YesBut: A High-Quality Annotated Multimodal Dataset for Evaluating Satire Comprehension Capability of Vision-Language Models" has been accepted as a long paper at #EMNLP2024! 🚀
Authors: Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly
🔑 Key Highlights:
What’s special?
- 🗂️ YesBut Dataset: A one-of-a-kind multimodal dataset with 2,547 images, combining satirical and non-satirical content, enriched with diverse artistic styles!
- 🖼️ Challenges: We introduced unique tasks like Satirical Image Detection, Satirical Image Understanding, and Satirical Image Completion—each pushing the boundaries of current Vision-Language (VL) models.
- 🤖 Benchmarking Results: Even cutting-edge VL models struggle with our tasks, showing the complexity of understanding irony, humor, and societal satire!
Future Directions:
- Developing models that truly grasp complex human emotions like irony and humor.
- Exploring cross-lingual satire comprehension and real-world applications in digital media.
Stay tuned for more! 🙌
Check out our
Check out this high-level, fun video explanation of the paper - https://www.youtube.com/watch?v=S7zJis8rEbw
NOW THIS IS RESEARCH
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images (2024)
- VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks (2024)
- Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images (2024)
- Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types (2024)
- Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
In table 3, Kosmos2 models have relatively large differences between the test accuracy and F1 though the prior is not that skewed. Addtionally, their values are almost switched after CoT applied. How can I interpret this? Aplogies in advance if I missed something in the paper.
Kosmos-2 in Zero-shot CoT setting performs well in 1 of the 2 classes (satirical and non-satirical), that is why, the accuracy is high, while the F1 Score is low.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper