YOLOv4: Optimal Speed and Accuracy of Object Detection Paper Summary

(Alexey Bochkovskiy,Chien-Yao Wang, Hong-Yuan Mark Liao)

What did the authors want to achieve ?

fast (real time) object detection, that can be trained on a GPU with 8-16 GB of VRAM
a model that can be easily trained and used
add state of the art methods for object detection, building on YOLOv3
find a good model for both GPU and VPU implementation

images

New Data Augmentation techniques are used, for photometric and geometric variability : => Random Erase, Cutout to leave part of image of certain value => Dropout, Dropblock does the same with the net params
Mixup : mult image augmentation
Style Transfer GAN for texture stability
Dataset Bias :

=> focal loss, data imbalance between different classics => one-hot hard representation => soft labels BBox regression :
MSE has x and y independent, also Anchors => IoU loss => coverage and area are considered => scale invariant, not the case with traditional methods => DIoU and CIoU loss

only a small cost of compute during inference, improve accuracy :
- enlarging receptive field : SPP, ASPP, RFB, SPP originates from Spatial Pyramid Matching => extract bag of words features - SPP infeasible for FCN nets, as it outputs a 1D feature vector => YOLOv3 concat of max pooling outputs with kxk kernel size => larger receptive field of the backbone => 2.7% higher AP50, 0.5 more compute necessary
- ASPP diff. to SPP : max pool of 3x3, dilation of k
- RFB : several dilated kxk convs => 7% more AP, 5.7% more compute Attention module :
mainly channelwise and pointwise attention, SE and SAM modules, SAM with no extra cost on GPU => better for us
Feature Integration :
skip connections, hyper-column
channelwise weighting on multi-scale with FPN methods :
- SFAM, ASFF, BiFPN, … Activation Functions :
  - Mish, Swish, (fully differentiable) PReLu, … Post-processing :
  - NMS “messy” with occlusions => DIoU NMS distance : center to BBox screening process
  - NMS method not necessary in anchor free method

things to think about : a reference that is good for classification is not always good for object detection, due to the detector needing :
- higher input size (for small objects)
- more layers (higher receptive field to cover larger input)
- more parameters (for greater capacity, to detect different sized objects)

=> a detector needs a backbone with more 3x3 convs and more params

images

Due to that, CSPDarknet53 seems to be the best choice in theory

The following improvements are done :

SPP additional module for larger receptive field (almost no runtime disadvantage)
PANet path-aggregation neck as param aggregation method instead of FPN from YOLOv3
YOLOv3 anchor based head is used
DropBlock regularization method

For single GPU training :

synchBN is not considered : goal is to run on single GPU, thanks guys !!
new data augmentation mosaic (mixes 4 training images), Self Adversarial Training => detection of objects outside their context
SAT : Forward Backward Training : 1) adversarial attack is performed on input 2) neural net is trained to detect and object on this moded image in a normal way

Also :

optimal hyper-params while applying genetic algos
Cross mini Batch Normalization : mini-batch split within batch
SAM is modified from spatial-wise to pointwise attention, as can be seen below :

images

Techniques :

Bag of freebies for Backbone : CutMix and Mosaic data augmentation, DropBlock reg., class label smoothing
Bag of freebies for detector : CIoU-loss, CmBN, DropBlock reg., Mosaic data augmentation, SAT, Eliminate gird sensitivity, multi anchors for a single ground truth, cosine annealing, hyperparams, random training shapes
Bag of Specials for Backbone : Mish activation, Cross-stage partial connections (CSP), Multi- input weighted residual connections (MiWRC)
Bag of specials for detector : Mish act., SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS

A FPS /mAP (@Iou50) comparison to other detecors can be seen below :

images