YOLOv4: Optimal Speed and Accuracy of Object Detection
The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detection accuracy enables using them not only for hint generating recommendation systems but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real-time and require a large number of GPUs for training with a large mini-batchsize. We address such problems by creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.
Figure 1: Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. YOLOv4 runs twice faster than efficient with comparable performance. Improves YOLOv3’s AP and FPS by 10% and 12%, respectively.
The main goal of this work is designing a fast-operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator.
- Related work
- Object Detection Models
A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet, and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG , ResNet , ResNeXt , or DenseNet . For those detectors running on the CPU platform, their backbone could be Squeeze Net , Mobile Net [28, 66, 27, 74], or Shuffle Net [97, 53].
Partsof Object Detector
- Input:Image, Patches, Image Pyramid
- Backbones:VGG16 , ResNet-50 , SpineNet , EfficientNet-B0/B7 , CSPResNeXt50 , CSPDarknet53 
- Additional blocks:SPP , ASPP , RFB , SAM 
- Path-aggregation blocks:FPN , PAN , NAS-FPN , Fully-connected FPN, BiFPN , ASFF , SFAM 
- Dense Prediction (one-stage): RPN , SSD , YOLO , RetinaNet  (anchor based) ◦ CornerNet , CenterNet , MatrixNet , FCOS  (anchor free)
- Sparse Prediction (two-stage):Faster R-CNN , R-FCN , Mask RCNN  (anchor based) ◦ RepPoints  (anchor free)
Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. The dashed line means only latency of model inference, while the solid line includes model inference and post-processing.
- Bag of Freebies
Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods that can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost a “bag of freebies.”The purpose of data augmentation is to increase the variability of the input images so that the designed object detection model has higher robustness to the images obtained from different environments.
In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining  or online hard example mining  in two-stage object detector.
But the example mining method does not apply to a one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al.  proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difficult to express the relationship of the degree of association between different categories with the one-hot hard representation.
To make this issue processed better, some researchers recently proposed IoU loss , which puts the coverage of predicted BBox area and ground truth BBox area into consideration. The IoU loss computing process will trigger the calculation of the four coordinate points of the BBox by executing IoU with the ground truth and then connecting the generated results into a whole code. Because IoU is a scale-invariant representation. Recently, some researchers have continued to improve IoU loss.
- Bag of Specials
For those plugin modules and post-processing methods that only increase the inference cost by a small amount but can significantly improve the accuracy of object detection, we call them “bag of specials”. Generally speaking, these plugin modules are for enhancing certain attributes in a model, such as enlarging receptive field, introducing attention mechanism, or strengthening feature integration capability, etc., and post-processing is a method for screening model prediction results.
Since multi-scale prediction methods such as FPN have become popular, many lightweight modules that integrate different feature pyramids have been proposed. The modules of this sort include SFAM , ASFF , and BiFPN . The main idea of SFAM is to use the SE module to execute channel-wise level re-weighting on multi-scale concatenated feature maps.As for ASFF, it uses SoftMax as point-wise level reweighting and then adds feature maps of different scales. In BiFPN, the multi-input weighted residual connections are proposed to execute scale-wise level re-weighting, and then add feature maps of different scales.
Example of the Representation Chosen when Predicting Bounding Box Position and Shape Taken from YOLO9000: Better, Faster, Stronger
The post-processing method commonly used in deep learning-based object detection is NMS, which can be used. The DIoU NMS  developers’ way of thinking is to add the information of the center point distance to the BBox screening process based on soft NMS. It is worth mentioning that, since none of the above postprocessing methods directly refer to the captured image features, post-processing is no longer required in the subsequent development of an anchor-free method.
- It’s incredibly fast and can process 45 frames per second to 150 frames per second.
- YOLO also understands generalized object representation.
- The network can generalize the image better.
- Comparatively low recall and more localization error compared to Faster R_CNN.
- Struggles to detect close objects because each grid can propose only 2 bounding boxes.
- Struggles to detect small objects.
We offer a state-of-the-art detector that is faster (FPS) and more accurate (MS COCO AP50…95 and AP50) than all available alternative detectors. The detector described can be trained and used on a conventional GPU with 8-16 GB-VRAM this makes its broad use possible. The original concept of one-stage anchor-based detectors has proven its viability.