The main ingredients of the new framework, called Detection Transformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors.

DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. The model leverages transformers, a type of neural network architecture that has been highly successful in natural language processing tasks, for computer vision tasks, particularly object detection.

Unlike traditional object detection models that rely on complex pipelines (e.g., region proposal networks, non-maximum suppression), DETR simplifies the process by using an end-to-end approach. DETR utilizes a transformer encoder-decoder architecture. The transformer encoder processes the input image features, while the decoder predicts the bounding boxes and class labels. DETR employs a unique set-based global loss function, which forces a unique and bipartite matching between predicted and ground-truth objects. To handle the spatial structure of images, DETR uses positional encodings, which are added to the input embeddings, enabling the model to capture spatial relationships between objects in the image.

