Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a new framework for object detection called DiffusionDet
  • DiffusionDet formulates object detection as a denoising diffusion process from noisy boxes to object boxes
  • During training, object boxes diffuse from ground-truth boxes to random distribution
  • Model learns to reverse this noising process during inference
  • Evaluations on MS-COCO and LVIS show favorable performance compared to previous detectors
  • Random boxes are effective object candidates
  • Object detection can be solved by a generative way

Paper Content

Introduction

  • We propose DiffusionDet, a novel noise-to-box object detection framework, which decouples the training and evaluation and enables progressive refinement.
  • We evaluate DiffusionDet on MS-COCO and LVIS datasets, and it achieves competitive performance compared to existing approaches.
  • We demonstrate the effectiveness of DiffusionDet in various settings, such as different backbones, different sampling steps, and different numbers of random boxes.
  • We release the source code of DiffusionDet for reproducibility.
  • Object detection aims to predict bounding boxes and category labels
  • Used in many related recognition scenarios
  • Modern approaches use empirically designed object candidates
  • Learnable object queries have been proposed
  • Question: simpler approach without learnable queries?
  • Noise-to-box approach proposed - no hand-designed components
  • DiffusionDet proposed - generative denoising process
  • Decouples training and evaluation, enables progressive refinement
  • Evaluated on MS-COCO and LVIS datasets
  • Source code released
  • Object detection is usually done using box regression and category classification on empirical object priors.
  • DETR [10] proposed a query-based detection paradigm.
  • Diffusion models are a class of deep generative models.
  • Diffusion models have been used for image generation, segmentation, and other tasks.
  • This is the first work to use a diffusion model for object detection.

Preliminaries

  • Object detection is a learning objective that takes an input image and produces a set of bounding boxes and category labels.
  • Diffusion models are a class of likelihood-based models inspired by nonequilibrium thermodynamics.
  • A neural network is trained to predict the bounding boxes from noisy boxes, conditioned on the corresponding image.

Architecture

  • Diffusion model generates data samples iteratively
  • Computationally intractable to directly apply model to raw image
  • Model separated into two parts: image encoder and detection decoder
  • Image encoder takes raw image and extracts high-level features
  • Detection decoder takes set of proposal boxes and sends to detection head
  • DiffusionDet begins from random boxes, re-uses detector head in iterative steps

Training

  • Construct diffusion process from ground-truth boxes to noisy boxes
  • Pad ground truth boxes to fixed number
  • Explore padding strategies (e.g. repeating, concatenating)
  • Add Gaussian noises to padded boxes
  • Use monotonically decreasing cosine schedule for noise scale
  • Use set prediction loss on set of predictions
  • Assign multiple predictions to each ground truth

Inference

  • DiffusionDet is a denoising sampling process from noise to object boxes.
  • In each sampling step, boxes are sent into the detection decoder to predict category classification and box coordinates.
  • Box renewal strategy is used to filter out undesired boxes and replace them with random boxes.
  • DiffusionDet can be evaluated with an arbitrary number of random boxes and sampling steps.

Experiments

  • DiffusionDet has a property called “once-for-all”
  • DiffusionDet is compared to other detectors on MS-COCO and LVIS datasets
  • MS-COCO dataset has 118K training images and 5K validation images, with 80 object categories
  • Evaluation metrics for MS-COCO are box average precision over multiple IoU thresholds (AP), threshold 0.5 (AP 50 ) and 0.75 (AP 75 )
  • LVIS dataset has 100K training images and 20K validation images, with 1203 categories
  • Evaluation metrics for LVIS are MS-COCO style box metric AP, AP 50 and AP 75

Implementation details.

  • ResNet and Swin backbone are initialized with pre-trained weights on ImageNet-1K and ImageNet-21K
  • Detection decoder is initialized with Xavier init
  • AdamW optimizer with initial learning rate of 2.5 × 10 −5 and weight decay of 10 −4
  • Training schedule for MS-COCO is 450K iterations, learning rate divided by 10 at 350K and 420K iterations
  • Training schedule for LVIS is 210K, 250K, 270K
  • Data augmentation includes random horizontal flip, scale jitter, and random crop
  • At inference stage, top-100 and top-300 scoring predictions for MS-COCO and LVIS, respectively, are selected and ensembled together by NMS

Main properties

  • DiffusionDet can be used with changing the number of boxes and number of sample steps in inference.
  • DiffusionDet can achieve better accuracy with more boxes or/and more refining steps.
  • DiffusionDet is compared with DETR to show the advantage of dynamic boxes.
  • Performance of DiffusionDet increases steadily with the number of boxes used for evaluation.
  • DETR has a clear performance drop when the number of boxes is different from the number of queries.
  • DiffusionDet can be improved by increasing the number of random boxes or the iterative steps.

Benchmarking on detection datasets

  • DiffusionDet is compared to 6 previous detectors
  • Results are compared on a challenging LVIS dataset
  • Reproduced Faster R-CNN and Cascade R-CNN using default settings of detectron2
  • Sparse R-CNN on its original code
  • Federated loss used to boost performance
  • DiffusionDet attains remarkable gains with more refinement steps
  • Refinement brings more gains on LVIS than MS-COCO

Ablation study

  • Experiments conducted on MS-COCO to study DiffusionDet
  • ResNet-50 with FPN used as backbone
  • Signal scaling factor of 2.0 achieves optimal AP performance
  • GT boxes padding strategy studied using different methods
  • Sampling strategy studied using different methods
  • Box renewal threshold studied using different methods
  • Matching between N train and N eval studied
  • Accuracy vs. speed studied
  • Random seed studied for stability

Conclusion and future work

  • Proposed a novel detection paradigm, DiffusionDet, by viewing object detection as a denoising diffusion process from noisy boxes to object boxes
  • Dynamic box and progressive refinement, enabling same network parameters to obtain desired speed-accuracy trade-off without re-training
  • Experiments on standard detection benchmarks show favorable performance compared to well-established detectors
  • Future works include applying DiffusionDet to video-level tasks, such as object tracking and action recognition, and extending DiffusionDet from close-world to open-world or open-vocabulary object detection
  • DiffusionDet has dynamic box property, progressive refinement property, and is able to benefit from more proposal boxes and iterative refinements using the same network parameters
  • DiffusionDet has negligible performance drop with more refinement steps, and even has performance gains with more refinement steps
  • DiffusionDet is optimized with a multi-task loss function, which includes focal loss, L1 loss, and GIoU loss
  • Visualize sampling step of DiffusionDet with 300 boxes, 50 boxes are drawn in the image