arxiv:2303.05499

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Published on Mar 9, 2023

Authors:

Tianhe Ren ,

Abstract

In this paper, we present an open-set object detector, called Grounding <PRE_TAG>DINO</POST_TAG>, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding <PRE_TAG>DINO</POST_TAG> performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and Ref<PRE_TAG>COCO/+/g</POST_TAG>. Grounding <PRE_TAG>DINO</POST_TAG> achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. Code will be available at https://github.com/IDEA-Research/GroundingDINO.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No Collection including this paper

Add this paper to a collection to link it from this page.