--- library_name: transformers tags: - vision - object-detection license: apache-2.0 language: - en pipeline_tag: object-detection --- # Model Card for Model ID ## Table of Contents 1. [Model Details](#model-details) 2. [Model Sources](#model-sources) 3. [How to Get Started with the Model](#how-to-get-started-with-the-model) 4. [Training Details](#training-details) 5. [Evaluation](#evaluation) 6. [Model Architecture and Objective](#model-architecture-and-objective) 7. [Citation](#citation) ## Model Details ![image/png](https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_convergence_plot.png) > We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods. ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang - **Funded by:** IDEA-Research - **Shared by:** David Hajdu - **Model type:** DAB-DETR - **License:** Apache-2.0 ### Model Sources - **Repository:** https://github.com/IDEA-Research/DAB-DETR - **Paper:** https://arxiv.org/abs/2201.12329 ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch import requests from PIL import Image from transformers import AutoModelForObjectDetection, AutoImageProcessor url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50") model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50") inputs = image_processor(images=image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3) for result in results: for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]): score, label = score.item(), label_id.item() box = [round(i, 2) for i in box.tolist()] print(f"{model.config.id2label[label]}: {score:.2f} {box}") ``` This should output ``` cat: 0.87 [14.7, 49.39, 320.52, 469.28] remote: 0.86 [41.08, 72.37, 173.39, 117.2] cat: 0.86 [344.45, 19.43, 639.85, 367.86] remote: 0.61 [334.27, 75.93, 367.92, 188.81] couch: 0.59 [-0.04, 1.34, 639.9, 477.09] ``` ## Training Details ### Training Data The DAB-DETR model was trained on [COCO 2017 object detection](https://cocodataset.org/#download), a dataset consisting of 118k/5k annotated images for training/validation respectively. ### Training Procedure Following Deformable DETR and Conditional DETR, we use 300 anchors as queries. We select 300 predicted boxes and labels with the largest classification logits for evaluation as well. We also use focal loss (Lin et al., 2020) with α = 0.25, γ = 2 for classification. The same loss terms are used in bipartite matching and final loss calculating, but with different coefficients. Classification loss with coefficient 2.0 is used in pipartite matching but 1.0 in the final loss. L1 loss with coefficient 5.0 and GIOU loss (Rezatofighi et al., 2019) with coefficient 2.0 are consistent in both the matching and the final loss calculation procedures. All models are trained on 16 GPUs with 1 image per GPU and AdamW (Loshchilov & Hutter, 2018) is used for training with weight decay 10−4. The learning rates for backbone and other modules are set to 10−5 and 10−4 respectively. We train our models for 50 epochs and drop the learning rate by 0.1 after 40 epochs. All models are trained on Nvidia A100 GPU. We search hyperparameters with batch size 64 and all results in our paper are reported with batch size 16 #### Preprocessing Images are resized/rescaled such that the shortest side is at least 480 and at most 800 pixels and the long size is at most 1333 pixels, and normalized across the RGB channels with the ImageNet mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225). ### Training Hyperparameters - **Training regime:** | **Key** | **Value** | |-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------| | **activation_dropout** | `0.0` | | **activation_function** | `prelu` | | **attention_dropout** | `0.0` | | **auxiliary_loss** | `false` | | **backbone** | `resnet50` | | **bbox_cost** | `5` | | **bbox_loss_coefficient** | `5` | | **class_cost** | `2` | | **cls_loss_coefficient** | `2` | | **decoder_attention_heads** | `8` | | **decoder_ffn_dim** | `2048` | | **decoder_layers** | `6` | | **dropout** | `0.1` | | **encoder_attention_heads** | `8` | | **encoder_ffn_dim** | `2048` | | **encoder_layers** | `6` | | **focal_alpha** | `0.25` | | **giou_cost** | `2` | | **giou_loss_coefficient** | `2` | | **hidden_size** | `256` | | **init_std** | `0.02` | | **init_xavier_std** | `1.0` | | **initializer_bias_prior_prob** | `null` | | **keep_query_pos** | `false` | | **normalize_before** | `false` | | **num_hidden_layers** | `6` | | **num_patterns** | `0` | | **num_queries** | `300` | | **query_dim** | `4` | | **random_refpoints_xy** | `false` | | **sine_position_embedding_scale** | `null` | | **temperature_height** | `20` | | **temperature_width** | `20` | ## Evaluation ![image/png](https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_results.png) ### Model Architecture and Objective ![image/png](https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_model_arch.png) Overview of DAB-DETR. We extract image spatial features using a CNN backbone followed with Transformer encoders to refine the CNN features. Then dual queries, including positional queries (anchor boxes) and content queries (decoder embeddings), are fed into the decoder to probe the objects which correspond to the anchors and have similar patterns with the content queries. The dual queries are updated layer-by-layer to get close to the target ground-truth objects gradually. The outputs of the final decoder layer are used to predict the objects with labels and boxes by prediction heads, and then a bipartite graph matching is conducted to calculate loss as in DETR. ## Citation **BibTeX:** ```bibtex @inproceedings{ liu2022dabdetr, title={{DAB}-{DETR}: Dynamic Anchor Boxes are Better Queries for {DETR}}, author={Shilong Liu and Feng Li and Hao Zhang and Xiao Yang and Xianbiao Qi and Hang Su and Jun Zhu and Lei Zhang}, booktitle={International Conference on Learning Representations}, year={2022}, url={https://openreview.net/forum?id=oMI9PjOb9Jl} } ``` ## Model Card Authors [David Hajdu](https://huggingface.co./davidhajdu)