arxiv:2406.19389

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Published on Jun 27, 2024

· Submitted by

LXT on Jun 28, 2024

#1 Paper of the day

Upvote

Authors:

Tao Zhang ,

Xiangtai Li ,

Hao Fei ,

Haobo Yuan ,

Shengqiong Wu ,

Abstract

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

View arXiv page View PDF Add to collection

Community

LXT

Paper author Paper submitter Jun 28, 2024

OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, this work only contains on one encoder, one decoder, and one LLM.

LXT

Paper author Paper submitter Jun 28, 2024

Project page: https://lxtgh.github.io/project/omg_llava/

Code: https://github.com/lxtGH/OMG-Seg

nielsr

Jul 1, 2024

Hi @LXT congrats on this work!

Would you be able to link the Space to this paper page? https://huggingface.co./spaces/LXT/OMG_Seg. See here on how to do that: https://huggingface.co./docs/hub/en/model-cards#linking-a-paper.

Also, I see 2 checkpoints are released (which can also be linked to this paper page), but they are currently in a single repository: https://huggingface.co./LXT/OMG_Seg. This way, model downloads don't work. Would you be interested in following https://huggingface.co./docs/hub/models-uploading#upload-a-pytorch-model-using-huggingfacehub to upload models as separate repos to the hub?

LXT

Paper author Jul 9, 2024

Hi! @nielsr Thanks for your suggestion.

For the first question, we will make a new space. Since this work is the next step of OMG-Seg.

For the second question, we will updated the model formate ASAP!!

saharmor

Jul 5, 2024

Kudos @LXT and team. I've featured this paper in my AI research newsletter https://www.aitidbits.ai/p/july-4th-2024#:~:text=and%20composition%20tasks-,Researchers,-propose%20OMG%2DLLaVA

Looking forward to more novel papers and methods.

LXT

Paper author Jul 9, 2024

Hi! Saharmer, Thanks for your attention to our work! We have released code and model. https://github.com/lxtGH/OMG-Seg/tree/main/omg_llava

L-Hongbin

Nov 13, 2024

@librarian-bot recommend

librarian-bot

Nov 13, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

L-Hongbin

Nov 13, 2024

@librarian-bot recommend

davanstrien

Nov 13, 2024

Hey @L-Hongbin , at the moment @librarian-bot only provides one recommendation per paper so we don't clutter the comments but you can directly request similar papers in this space: https://huggingface.co./spaces/librarian-bots/recommend_similar_papers :)