Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents a fast and accurate object detection method called DAMO-YOLO
  • Uses Neural Architecture Search, Reparameterized Generalized-FPN, AlignedOTA label assignment, and distillation enhancement
  • Searches for a detection backbone with low latency and high performance
  • Follows the rule of “large neck, small head”
  • Investigates how detector head size affects detection performance
  • Achieves 43.0/46.8/50.0 mAPs on COCO with the latency of 2.78/3.83/5.62 ms on T4 GPUs

Paper Content

Introduction

  • Researchers have developed object detection methods with great progress
  • Network structure plays a critical role in object detection, and NAS techs have been used to find efficient network structures
  • Dynamic label assignment and Knowledge Distillation have been used to improve performance of object detection

Damo-yolo

  • Introduces each module of DAMO-YOLO
  • Includes Neural Architecture Search (NAS) backbones, efficient Reparameterized Generalized-FPN (RepGFPN) neck, ZeroHead, AlignedOTA label assignment and distillation enhancement
  • Whole framework of DAMO-YOLO is displayed in Fig. 2

Mae-nas backbone

  • MAE-NAS is used to obtain optimal networks under different computational budgets.
  • Search process only takes a few hours.
  • Backbones designed in the vanilla convolution network space with a new search block “k1kx”.
  • GPU inference latency, not FLOPs, used as the target budget.
  • SPP, Focus and CSP modules applied to the final backbones.

Efficient repgfpn

  • Feature Pyramid Network (FPN) is an effective part of object detection
  • Conventional FPN has a top-down pathway to fuse multi-scale features
  • PAFPN adds a bottom-up path aggregation network, but with higher computational costs
  • BiFPN removes nodes with one input edge and adds skiplink from the original input
  • GFPN is proposed to serve as neck and achieves SOTA performance by exchanging high-level semantic and low-level spatial information

Zerohead and alignota

  • Decoupled head is widely used in object detection
  • Label assignment is a crucial component during detector training
  • Dynamic label assignment methods assign labels according to the assignment cost between prediction and ground truth
  • Misalignment of classification and regression is a common issue in static assignment methods
  • Focal loss is introduced into the classification cost and IoU of prediction and ground truth box is used as the soft label
  • AlignOTA outperforms all other label assignment methods

Distillation enhancement

  • Knowledge Distillation (KD) is an effective method to boost the performance of pocket-size models
  • KD can be difficult to optimize and features can carry too much noise
  • DAMO-YOLO uses feature-based distillation to transfer dark knowledge
  • Experiments show CWD is more suitable for DAMO-YOLO
  • Distillation is split into two stages
  • Align Module and Channel-wise Dynamic Temperature are used to enhance distillation
  • Balance between distillation and task loss is necessary
  • Shallow head of detector is beneficial to feature distillation

Implementation details

  • Trained 300 epochs with SGD optimizer
  • Weight decay and SGD momentum of 5e-4 and 0.9
  • Initial learning rate of 0.4 with batch size of 256
  • Learning rate decays according to cosine schedule
  • Utilized Mosaic and Mixup for image-level augmentation and SADA for box-level augmentation

Comparison with the sota

  • DAMO-YOLO family outperforms all YOLO series in accuracy and speed
  • Method can detect objects effectively and efficiently

Conclusion