Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents a fast and accurate object detection method called DAMO-YOLO
- Uses Neural Architecture Search, Reparameterized Generalized-FPN, AlignedOTA label assignment, and distillation enhancement
- Searches for a detection backbone with low latency and high performance
- Follows the rule of “large neck, small head”
- Investigates how detector head size affects detection performance
- Achieves 43.0/46.8/50.0 mAPs on COCO with the latency of 2.78/3.83/5.62 ms on T4 GPUs
Paper Content
Introduction
- Researchers have developed object detection methods with great progress
- Network structure plays a critical role in object detection, and NAS techs have been used to find efficient network structures
- Dynamic label assignment and Knowledge Distillation have been used to improve performance of object detection
Damo-yolo
- Introduces each module of DAMO-YOLO
- Includes Neural Architecture Search (NAS) backbones, efficient Reparameterized Generalized-FPN (RepGFPN) neck, ZeroHead, AlignedOTA label assignment and distillation enhancement
- Whole framework of DAMO-YOLO is displayed in Fig. 2
Mae-nas backbone
- MAE-NAS is used to obtain optimal networks under different computational budgets.
- Search process only takes a few hours.
- Backbones designed in the vanilla convolution network space with a new search block “k1kx”.
- GPU inference latency, not FLOPs, used as the target budget.
- SPP, Focus and CSP modules applied to the final backbones.
Efficient repgfpn
- Feature Pyramid Network (FPN) is an effective part of object detection
- Conventional FPN has a top-down pathway to fuse multi-scale features
- PAFPN adds a bottom-up path aggregation network, but with higher computational costs
- BiFPN removes nodes with one input edge and adds skiplink from the original input
- GFPN is proposed to serve as neck and achieves SOTA performance by exchanging high-level semantic and low-level spatial information
Zerohead and alignota
- Decoupled head is widely used in object detection
- Label assignment is a crucial component during detector training
- Dynamic label assignment methods assign labels according to the assignment cost between prediction and ground truth
- Misalignment of classification and regression is a common issue in static assignment methods
- Focal loss is introduced into the classification cost and IoU of prediction and ground truth box is used as the soft label
- AlignOTA outperforms all other label assignment methods
Distillation enhancement
- Knowledge Distillation (KD) is an effective method to boost the performance of pocket-size models
- KD can be difficult to optimize and features can carry too much noise
- DAMO-YOLO uses feature-based distillation to transfer dark knowledge
- Experiments show CWD is more suitable for DAMO-YOLO
- Distillation is split into two stages
- Align Module and Channel-wise Dynamic Temperature are used to enhance distillation
- Balance between distillation and task loss is necessary
- Shallow head of detector is beneficial to feature distillation
Implementation details
- Trained 300 epochs with SGD optimizer
- Weight decay and SGD momentum of 5e-4 and 0.9
- Initial learning rate of 0.4 with batch size of 256
- Learning rate decays according to cosine schedule
- Utilized Mosaic and Mixup for image-level augmentation and SADA for box-level augmentation
Comparison with the sota
- DAMO-YOLO family outperforms all YOLO series in accuracy and speed
- Method can detect objects effectively and efficiently