Model Card for Model ID

Table of Contents

  1. Model Details
  2. Model Sources
  3. How to Get Started with the Model
  4. Training Details
  5. Evaluation
  6. Model Architecture and Objective
  7. Citation

Model Details

image/png

We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods.

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang
  • Funded by: IDEA-Research
  • Shared by: David Hajdu
  • Model type: DAB-DETR
  • License: Apache-2.0

Model Sources

How to Get Started with the Model

Use the code below to get started with the model.

import torch
import requests

from PIL import Image
from transformers import AutoModelForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50-dc5-fixxy")
model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50-dc5-fixxy")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

This should output

remote: 0.85 [41.41, 72.6, 177.42, 118.84]
cat: 0.84 [343.45, 21.74, 641.99, 368.87]
cat: 0.82 [13.25, 54.13, 318.95, 470.27]
remote: 0.70 [333.44, 76.56, 369.1, 189.68]
couch: 0.55 [-0.95, 0.03, 639.02, 476.81]

Training Details

Training Data

The DAB-DETR model was trained on COCO 2017 object detection, a dataset consisting of 118k/5k annotated images for training/validation respectively.

Training Procedure

Following Deformable DETR and Conditional DETR, we use 300 anchors as queries. We select 300 predicted boxes and labels with the largest classification logits for evaluation as well. We also use focal loss (Lin et al., 2020) with α = 0.25, γ = 2 for classification. The same loss terms are used in bipartite matching and final loss calculating, but with different coefficients. Classification loss with coefficient 2.0 is used in pipartite matching but 1.0 in the final loss. L1 loss with coefficient 5.0 and GIOU loss (Rezatofighi et al., 2019) with coefficient 2.0 are consistent in both the matching and the final loss calculation procedures. All models are trained on 16 GPUs with 1 image per GPU and AdamW (Loshchilov & Hutter, 2018) is used for training with weight decay 10−4. The learning rates for backbone and other modules are set to 10−5 and 10−4 respectively. We train our models for 50 epochs and drop the learning rate by 0.1 after 40 epochs. All models are trained on Nvidia A100 GPU. We search hyperparameters with batch size 64 and all results in our paper are reported with batch size 16

Preprocessing

Images are resized/rescaled such that the shortest side is at least 480 and at most 800 pixels and the long size is at most 1333 pixels, and normalized across the RGB channels with the ImageNet mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225).

Training Hyperparameters

  • Training regime:
Key Value
activation_dropout 0.0
activation_function prelu
attention_dropout 0.0
auxiliary_loss false
backbone resnet50
bbox_cost 5
bbox_loss_coefficient 5
class_cost 2
cls_loss_coefficient 2
decoder_attention_heads 8
decoder_ffn_dim 2048
decoder_layers 6
dropout 0.1
encoder_attention_heads 8
encoder_ffn_dim 2048
encoder_layers 6
focal_alpha 0.25
giou_cost 2
giou_loss_coefficient 2
hidden_size 256
init_std 0.02
init_xavier_std 1.0
initializer_bias_prior_prob null
keep_query_pos false
normalize_before false
num_hidden_layers 6
num_patterns 0
num_queries 300
query_dim 4
random_refpoints_xy false
sine_position_embedding_scale null
temperature_height 20
temperature_width 20

Evaluation

image/png

Model Architecture and Objective

image/png

Overview of DAB-DETR. We extract image spatial features using a CNN backbone followed with Transformer encoders to refine the CNN features. Then dual queries, including positional queries (anchor boxes) and content queries (decoder embeddings), are fed into the decoder to probe the objects which correspond to the anchors and have similar patterns with the content queries. The dual queries are updated layer-by-layer to get close to the target ground-truth objects gradually. The outputs of the final decoder layer are used to predict the objects with labels and boxes by prediction heads, and then a bipartite graph matching is conducted to calculate loss as in DETR.

Citation

BibTeX:

@inproceedings{
  liu2022dabdetr,
  title={{DAB}-{DETR}: Dynamic Anchor Boxes are Better Queries for {DETR}},
  author={Shilong Liu and Feng Li and Hao Zhang and Xiao Yang and Xianbiao Qi and Hang Su and Jun Zhu and Lei Zhang},
  booktitle={International Conference on Learning Representations},
  year={2022},
  url={https://openreview.net/forum?id=oMI9PjOb9Jl}
}

Model Card Authors

David Hajdu

Downloads last month
40
Safetensors
Model size
43.8M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.