Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • YOLOv7 surpasses all known object detectors in both speed and accuracy
  • YOLOv7-E6 outperforms other detectors in speed and accuracy
  • YOLOv7 outperforms other object detectors in speed and accuracy
  • YOLOv7 is trained on MS COCO dataset from scratch without using any other datasets or pre-trained weights

Paper Content

Introduction

  • Real-time object detection is important in computer vision
  • Computing devices used for real-time object detection are mobile CPUs, GPUs, and NPUs
  • Real-time object detector proposed in paper supports mobile GPU and GPU devices from edge to cloud
  • Edge devices focus on speeding up operations such as convolution, depth-wise convolution, or MLP
  • Real-time object detectors for CPU are based on MobileNet, ShuffleNet, or GhostNet
  • Real-time object detectors for GPU are based on ResNet, DarkNet, or DLA
  • Proposed methods focus on optimization of training process
  • Model re-parameterization and dynamic label assignment are important topics in network training and object detection
  • Proposed methods address new issues discovered in training of object detector
  • Proposed methods reduce parameters and computation of state-of-the-art real-time object detector

Model re-parameterization

  • Common practices for model-level reparameterization involve training multiple models and averaging their weights.
  • Module-level re-parameterization splits a module into multiple branches during training and integrates them into a single module during inference.
  • New re-parameterization module and application strategies have been developed for various architectures.

Model scaling

  • Model scaling is a way to adjust an existing model to fit different computing devices.
  • NAS is a commonly used model scaling method, but it is computationally expensive.
  • Most model scaling methods analyze individual scaling factors independently.

Architecture

Extended efficient layer aggregation networks

  • Main considerations in designing efficient architectures are number of parameters, amount of computation, and computational density
  • Ma et al. analyzed influence of input/output channel ratio, number of branches, and element-wise operation on network inference speed
  • Dollár et al. considered activation when performing model scaling
  • CSPVoVNet considers basic designing concerns and gradient path
  • ELAN considers design strategy of controlling shortest longest gradient path
  • E-ELAN uses expand, shuffle, merge cardinality to enhance learning ability without destroying gradient path

Model scaling for concatenation-based models

  • Model scaling adjusts attributes of a model to meet different inference speeds
  • EfficientNet and Scaled-YOLOv4 adjust width, depth, and resolution
  • Dollár et al. analyzed influence of convolution on amount of parameter and computation
  • When depth scaling is performed on concatenation-based models, output width of computational block increases
  • When scaling model on concatenation-based model, depth is scaled and width is scaled with same amount of change
  • RepConv combines 3x3 convolution, 1x1 convolution, and identity connection in one convolutional layer
  • RepConvN (without identity connection) is used in PlainNet and ResNet

Coarse for auxiliary and fine for lead loss

  • Deep supervision is a technique used to train deep networks
  • It adds extra auxiliary heads in the middle layers of the network
  • It can improve the performance of the model on many tasks
  • Label assignment usually refers to the ground truth and generates hard labels
  • Soft labels are generated by considering the network prediction results and the ground truth
  • A new label assignment method is proposed that guides both auxiliary head and lead head by the lead head prediction
  • This method generates two different sets of soft labels, coarse label and fine label

Other trainable bag-of-freebies

  • Batch normalization in conv-bn-activation topology
  • Implicit knowledge in YOLOR combined with convolution feature map in addition and multiplication manner
  • EMA model used as final inference model

Experiments

Experimental setup

  • Used Microsoft COCO dataset to conduct experiments and validate object detection method
  • Trained models from scratch
  • Used train 2017 set for training, val 2017 set for verification and test 2017 set for performance comparison
  • Designed basic models for edge GPU, normal GPU, and cloud GPU
  • Used stack scaling and compound scaling methods to obtain different types of models
  • Used leaky ReLU and SiLU as activation functions
  • FLOPs calculated by rectangle input resolution and inference time estimated by letterbox resize input image

Baselines

  • YOLOv7 has 75% fewer parameters, 36% less computation, and 1.5% higher AP than YOLOv4
  • YOLOv7 has 43% fewer parameters, 15% less computation, and 0.4% higher AP than YOLOR-CSP
  • YOLOv7tiny reduces the number of parameters by 39% and the amount of computation by 49%, but maintains the same AP
  • YOLOv7 has 19% fewer parameters and 33% less computation, but still has a higher AP than the cloud GPU model

Comparison with state-of-the-arts

  • Proposed method has best speed-accuracy trade-off
  • 127 fps faster and 10.7% more accurate than YOLOv5-N
  • YOLOv7 has 51.4% AP at 161 fps, PPYOLOE-L has 78 fps
  • YOLOv7 has 41% less parameters than PPYOLOE-L
  • YOLOv7-X is 3.9% more accurate than YOLOv5-L
  • YOLOv7-X is 31 fps faster than YOLOv5-X
  • YOLOv7-X has 22% less parameters and 8% less computation than YOLOv5-X
  • YOLOv7-W6 is 8 fps faster and 1% more accurate than YOLOR-P6
  • YOLOv7-E6 is 0.9% more accurate, 45% less parameters and 63% less computation than YOLOv5-X6
  • YOLOv7-D6 is 0.8% more accurate than YOLOR-E6
  • YOLOv7-E6E is 0.3% more accurate than YOLOR-D6
  • Compound scaling method improves AP by 0.5% with less parameters and computation
  • RepConv improves AP in concatenation-based model
  • Reversed dark block improves AP in residual-based model
  • Lead guided label assignment improves AP, AP 50 and AP 75
  • Partial coarse-to-fine lead guided method has better auxiliary effect

Conclusions

  • Proposed a new architecture of realtime object detector and corresponding model scaling method
  • Found replacement problem of re-parameterized module and allocation problem of dynamic label assignment
  • Proposed trainable bag-of-freebies method to enhance accuracy of object detection
  • Developed YOLOv7 series of object detection systems with state-of-the-art results
  • YOLOv7 surpasses all known object detectors in speed and accuracy
  • YOLOv7-E6 outperforms transformer-based and convolutional-based detectors in speed and accuracy
  • Trained YOLOv7 only on MS COCO dataset from scratch
  • Maximum accuracy of YOLOv7-E6 is +13.7% AP higher than current most accurate model
  • YOLOv7-tiny is +25% faster and +0.2% AP higher than other model
  • Ablation studies on proposed model scaling, planned RepConcatenation model, planned RepResidual model, auxiliary head, constrained auxiliary head, and partial auxiliary head
  • Comparison of baseline object detectors, state-of-the-art real-time object detectors, and different settings