Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

YOLOv7 surpasses all known object detectors in both speed and accuracy
YOLOv7-E6 outperforms other detectors in speed and accuracy
YOLOv7 outperforms other object detectors in speed and accuracy
YOLOv7 is trained on MS COCO dataset from scratch without using any other datasets or pre-trained weights

Paper Content

Introduction

Real-time object detection is important in computer vision
Computing devices used for real-time object detection are mobile CPUs, GPUs, and NPUs
Real-time object detector proposed in paper supports mobile GPU and GPU devices from edge to cloud
Edge devices focus on speeding up operations such as convolution, depth-wise convolution, or MLP
Real-time object detectors for CPU are based on MobileNet, ShuffleNet, or GhostNet
Real-time object detectors for GPU are based on ResNet, DarkNet, or DLA
Proposed methods focus on optimization of training process
Model re-parameterization and dynamic label assignment are important topics in network training and object detection
Proposed methods address new issues discovered in training of object detector
Proposed methods reduce parameters and computation of state-of-the-art real-time object detector

Model re-parameterization

Common practices for model-level reparameterization involve training multiple models and averaging their weights.
Module-level re-parameterization splits a module into multiple branches during training and integrates them into a single module during inference.
New re-parameterization module and application strategies have been developed for various architectures.

Model scaling

Model scaling is a way to adjust an existing model to fit different computing devices.
NAS is a commonly used model scaling method, but it is computationally expensive.
Most model scaling methods analyze individual scaling factors independently.

Architecture

Extended efficient layer aggregation networks

Main considerations in designing efficient architectures are number of parameters, amount of computation, and computational density
Ma et al. analyzed influence of input/output channel ratio, number of branches, and element-wise operation on network inference speed
Dollár et al. considered activation when performing model scaling
CSPVoVNet considers basic designing concerns and gradient path
ELAN considers design strategy of controlling shortest longest gradient path
E-ELAN uses expand, shuffle, merge cardinality to enhance learning ability without destroying gradient path

Model scaling for concatenation-based models

Model scaling adjusts attributes of a model to meet different inference speeds
EfficientNet and Scaled-YOLOv4 adjust width, depth, and resolution
Dollár et al. analyzed influence of convolution on amount of parameter and computation
When depth scaling is performed on concatenation-based models, output width of computational block increases
When scaling model on concatenation-based model, depth is scaled and width is scaled with same amount of change
RepConv combines 3x3 convolution, 1x1 convolution, and identity connection in one convolutional layer
RepConvN (without identity connection) is used in PlainNet and ResNet

Coarse for auxiliary and fine for lead loss

Deep supervision is a technique used to train deep networks
It adds extra auxiliary heads in the middle layers of the network
It can improve the performance of the model on many tasks
Label assignment usually refers to the ground truth and generates hard labels
Soft labels are generated by considering the network prediction results and the ground truth
A new label assignment method is proposed that guides both auxiliary head and lead head by the lead head prediction
This method generates two different sets of soft labels, coarse label and fine label

Other trainable bag-of-freebies

Batch normalization in conv-bn-activation topology
Implicit knowledge in YOLOR combined with convolution feature map in addition and multiplication manner
EMA model used as final inference model

Experiments

Experimental setup

Used Microsoft COCO dataset to conduct experiments and validate object detection method
Trained models from scratch
Used train 2017 set for training, val 2017 set for verification and test 2017 set for performance comparison
Designed basic models for edge GPU, normal GPU, and cloud GPU
Used stack scaling and compound scaling methods to obtain different types of models
Used leaky ReLU and SiLU as activation functions
FLOPs calculated by rectangle input resolution and inference time estimated by letterbox resize input image

Baselines

YOLOv7 has 75% fewer parameters, 36% less computation, and 1.5% higher AP than YOLOv4
YOLOv7 has 43% fewer parameters, 15% less computation, and 0.4% higher AP than YOLOR-CSP
YOLOv7tiny reduces the number of parameters by 39% and the amount of computation by 49%, but maintains the same AP
YOLOv7 has 19% fewer parameters and 33% less computation, but still has a higher AP than the cloud GPU model

Comparison with state-of-the-arts

Proposed method has best speed-accuracy trade-off
127 fps faster and 10.7% more accurate than YOLOv5-N
YOLOv7 has 51.4% AP at 161 fps, PPYOLOE-L has 78 fps
YOLOv7 has 41% less parameters than PPYOLOE-L
YOLOv7-X is 3.9% more accurate than YOLOv5-L
YOLOv7-X is 31 fps faster than YOLOv5-X
YOLOv7-X has 22% less parameters and 8% less computation than YOLOv5-X
YOLOv7-W6 is 8 fps faster and 1% more accurate than YOLOR-P6
YOLOv7-E6 is 0.9% more accurate, 45% less parameters and 63% less computation than YOLOv5-X6
YOLOv7-D6 is 0.8% more accurate than YOLOR-E6
YOLOv7-E6E is 0.3% more accurate than YOLOR-D6
Compound scaling method improves AP by 0.5% with less parameters and computation
RepConv improves AP in concatenation-based model
Reversed dark block improves AP in residual-based model
Lead guided label assignment improves AP, AP 50 and AP 75
Partial coarse-to-fine lead guided method has better auxiliary effect

Conclusions

Proposed a new architecture of realtime object detector and corresponding model scaling method
Found replacement problem of re-parameterized module and allocation problem of dynamic label assignment
Proposed trainable bag-of-freebies method to enhance accuracy of object detection
Developed YOLOv7 series of object detection systems with state-of-the-art results
YOLOv7 surpasses all known object detectors in speed and accuracy
YOLOv7-E6 outperforms transformer-based and convolutional-based detectors in speed and accuracy
Trained YOLOv7 only on MS COCO dataset from scratch
Maximum accuracy of YOLOv7-E6 is +13.7% AP higher than current most accurate model
YOLOv7-tiny is +25% faster and +0.2% AP higher than other model
Ablation studies on proposed model scaling, planned RepConcatenation model, planned RepResidual model, auxiliary head, constrained auxiliary head, and partial auxiliary head
Comparison of baseline object detectors, state-of-the-art real-time object detectors, and different settings

Link to paper#

Abstract#

Paper Content#

Introduction#

Model re-parameterization#

Model scaling#

Architecture#

Extended efficient layer aggregation networks#

Model scaling for concatenation-based models#

Coarse for auxiliary and fine for lead loss#

Other trainable bag-of-freebies#

Experiments#

Experimental setup#

Baselines#

Comparison with state-of-the-arts#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Model re-parameterization

Model scaling

Architecture

Extended efficient layer aggregation networks

Model scaling for concatenation-based models

Coarse for auxiliary and fine for lead loss

Other trainable bag-of-freebies

Experiments

Experimental setup

Baselines

Comparison with state-of-the-arts

Conclusions