Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- YOLOv7 surpasses all known object detectors in both speed and accuracy
- YOLOv7-E6 outperforms other detectors in speed and accuracy
- YOLOv7 outperforms other object detectors in speed and accuracy
- YOLOv7 is trained on MS COCO dataset from scratch without using any other datasets or pre-trained weights
Paper Content
Introduction
- Real-time object detection is important in computer vision
- Computing devices used for real-time object detection are mobile CPUs, GPUs, and NPUs
- Real-time object detector proposed in paper supports mobile GPU and GPU devices from edge to cloud
- Edge devices focus on speeding up operations such as convolution, depth-wise convolution, or MLP
- Real-time object detectors for CPU are based on MobileNet, ShuffleNet, or GhostNet
- Real-time object detectors for GPU are based on ResNet, DarkNet, or DLA
- Proposed methods focus on optimization of training process
- Model re-parameterization and dynamic label assignment are important topics in network training and object detection
- Proposed methods address new issues discovered in training of object detector
- Proposed methods reduce parameters and computation of state-of-the-art real-time object detector
Model re-parameterization
- Common practices for model-level reparameterization involve training multiple models and averaging their weights.
- Module-level re-parameterization splits a module into multiple branches during training and integrates them into a single module during inference.
- New re-parameterization module and application strategies have been developed for various architectures.
Model scaling
- Model scaling is a way to adjust an existing model to fit different computing devices.
- NAS is a commonly used model scaling method, but it is computationally expensive.
- Most model scaling methods analyze individual scaling factors independently.
Architecture
Extended efficient layer aggregation networks
- Main considerations in designing efficient architectures are number of parameters, amount of computation, and computational density
- Ma et al. analyzed influence of input/output channel ratio, number of branches, and element-wise operation on network inference speed
- Dollár et al. considered activation when performing model scaling
- CSPVoVNet considers basic designing concerns and gradient path
- ELAN considers design strategy of controlling shortest longest gradient path
- E-ELAN uses expand, shuffle, merge cardinality to enhance learning ability without destroying gradient path
Model scaling for concatenation-based models
- Model scaling adjusts attributes of a model to meet different inference speeds
- EfficientNet and Scaled-YOLOv4 adjust width, depth, and resolution
- Dollár et al. analyzed influence of convolution on amount of parameter and computation
- When depth scaling is performed on concatenation-based models, output width of computational block increases
- When scaling model on concatenation-based model, depth is scaled and width is scaled with same amount of change
- RepConv combines 3x3 convolution, 1x1 convolution, and identity connection in one convolutional layer
- RepConvN (without identity connection) is used in PlainNet and ResNet
Coarse for auxiliary and fine for lead loss
- Deep supervision is a technique used to train deep networks
- It adds extra auxiliary heads in the middle layers of the network
- It can improve the performance of the model on many tasks
- Label assignment usually refers to the ground truth and generates hard labels
- Soft labels are generated by considering the network prediction results and the ground truth
- A new label assignment method is proposed that guides both auxiliary head and lead head by the lead head prediction
- This method generates two different sets of soft labels, coarse label and fine label
Other trainable bag-of-freebies
- Batch normalization in conv-bn-activation topology
- Implicit knowledge in YOLOR combined with convolution feature map in addition and multiplication manner
- EMA model used as final inference model
Experiments
Experimental setup
- Used Microsoft COCO dataset to conduct experiments and validate object detection method
- Trained models from scratch
- Used train 2017 set for training, val 2017 set for verification and test 2017 set for performance comparison
- Designed basic models for edge GPU, normal GPU, and cloud GPU
- Used stack scaling and compound scaling methods to obtain different types of models
- Used leaky ReLU and SiLU as activation functions
- FLOPs calculated by rectangle input resolution and inference time estimated by letterbox resize input image
Baselines
- YOLOv7 has 75% fewer parameters, 36% less computation, and 1.5% higher AP than YOLOv4
- YOLOv7 has 43% fewer parameters, 15% less computation, and 0.4% higher AP than YOLOR-CSP
- YOLOv7tiny reduces the number of parameters by 39% and the amount of computation by 49%, but maintains the same AP
- YOLOv7 has 19% fewer parameters and 33% less computation, but still has a higher AP than the cloud GPU model
Comparison with state-of-the-arts
- Proposed method has best speed-accuracy trade-off
- 127 fps faster and 10.7% more accurate than YOLOv5-N
- YOLOv7 has 51.4% AP at 161 fps, PPYOLOE-L has 78 fps
- YOLOv7 has 41% less parameters than PPYOLOE-L
- YOLOv7-X is 3.9% more accurate than YOLOv5-L
- YOLOv7-X is 31 fps faster than YOLOv5-X
- YOLOv7-X has 22% less parameters and 8% less computation than YOLOv5-X
- YOLOv7-W6 is 8 fps faster and 1% more accurate than YOLOR-P6
- YOLOv7-E6 is 0.9% more accurate, 45% less parameters and 63% less computation than YOLOv5-X6
- YOLOv7-D6 is 0.8% more accurate than YOLOR-E6
- YOLOv7-E6E is 0.3% more accurate than YOLOR-D6
- Compound scaling method improves AP by 0.5% with less parameters and computation
- RepConv improves AP in concatenation-based model
- Reversed dark block improves AP in residual-based model
- Lead guided label assignment improves AP, AP 50 and AP 75
- Partial coarse-to-fine lead guided method has better auxiliary effect
Conclusions
- Proposed a new architecture of realtime object detector and corresponding model scaling method
- Found replacement problem of re-parameterized module and allocation problem of dynamic label assignment
- Proposed trainable bag-of-freebies method to enhance accuracy of object detection
- Developed YOLOv7 series of object detection systems with state-of-the-art results
- YOLOv7 surpasses all known object detectors in speed and accuracy
- YOLOv7-E6 outperforms transformer-based and convolutional-based detectors in speed and accuracy
- Trained YOLOv7 only on MS COCO dataset from scratch
- Maximum accuracy of YOLOv7-E6 is +13.7% AP higher than current most accurate model
- YOLOv7-tiny is +25% faster and +0.2% AP higher than other model
- Ablation studies on proposed model scaling, planned RepConcatenation model, planned RepResidual model, auxiliary head, constrained auxiliary head, and partial auxiliary head
- Comparison of baseline object detectors, state-of-the-art real-time object detectors, and different settings