YOLOv4: Optimal Speed and Accuracy of Object Detection Paper Summary

(Alexey Bochkovskiy,Chien-Yao Wang, Hong-Yuan Mark Liao)

What did the authors want to achieve ?

  • fast (real time) object detection, that can be trained on a GPU with 8-16 GB of VRAM
  • a model that can be easily trained and used
  • add state of the art methods for object detection, building on YOLOv3
  • find a good model for both GPU and VPU implementation

images

Methods used

Bag of Freebies (methods that only increase training runtime and not inference)

  • New Data Augmentation techniques are used, for photometric and geometric variability : => Random Erase, Cutout to leave part of image of certain value => Dropout, Dropblock does the same with the net params
  • Mixup : mult image augmentation
  • Style Transfer GAN for texture stability

    Dataset Bias :

    => focal loss, data imbalance between different classics => one-hot hard representation => soft labels BBox regression :

  • MSE has x and y independent, also Anchors => IoU loss => coverage and area are considered => scale invariant, not the case with traditional methods => DIoU and CIoU loss

Bag of specials (methods that only have a small impact on inference speed, but improve accuracy significantly)

  • only a small cost of compute during inference, improve accuracy :
    • enlarging receptive field : SPP, ASPP, RFB, SPP originates from Spatial Pyramid Matching => extract bag of words features - SPP infeasible for FCN nets, as it outputs a 1D feature vector => YOLOv3 concat of max pooling outputs with kxk kernel size => larger receptive field of the backbone => 2.7% higher AP50, 0.5 more compute necessary
    • ASPP diff. to SPP : max pool of 3x3, dilation of k
    • RFB : several dilated kxk convs => 7% more AP, 5.7% more compute Attention module :
  • mainly channelwise and pointwise attention, SE and SAM modules, SAM with no extra cost on GPU => better for us
    Feature Integration :
  • skip connections, hyper-column
  • channelwise weighting on multi-scale with FPN methods :
    • SFAM, ASFF, BiFPN, … Activation Functions :
      • Mish, Swish, (fully differentiable) PReLu, … Post-processing :
      • NMS “messy” with occlusions => DIoU NMS distance : center to BBox screening process
      • NMS method not necessary in anchor free method

Architecture Selection

  • things to think about : a reference that is good for classification is not always good for object detection, due to the detector needing :
    • higher input size (for small objects)
    • more layers (higher receptive field to cover larger input)
    • more parameters (for greater capacity, to detect different sized objects)

=> a detector needs a backbone with more 3x3 convs and more params

images

Due to that, CSPDarknet53 seems to be the best choice in theory

The following improvements are done :

  • SPP additional module for larger receptive field (almost no runtime disadvantage)
  • PANet path-aggregation neck as param aggregation method instead of FPN from YOLOv3
  • YOLOv3 anchor based head is used
  • DropBlock regularization method

For single GPU training :

  • synchBN is not considered : goal is to run on single GPU, thanks guys !!
  • new data augmentation mosaic (mixes 4 training images), Self Adversarial Training => detection of objects outside their context
    SAT : Forward Backward Training : 1) adversarial attack is performed on input 2) neural net is trained to detect and object on this moded image in a normal way

Also :

  • optimal hyper-params while applying genetic algos
  • Cross mini Batch Normalization : mini-batch split within batch
  • SAM is modified from spatial-wise to pointwise attention, as can be seen below :

images

Architecture Summary

  • Backbone : CSPDarknet53
  • Neck : SPP, PAN
  • Head : Yolov3

Techniques :

  • Bag of freebies for Backbone : CutMix and Mosaic data augmentation, DropBlock reg., class label smoothing
  • Bag of freebies for detector : CIoU-loss, CmBN, DropBlock reg., Mosaic data augmentation, SAT, Eliminate gird sensitivity, multi anchors for a single ground truth, cosine annealing, hyperparams, random training shapes
  • Bag of Specials for Backbone : Mish activation, Cross-stage partial connections (CSP), Multi- input weighted residual connections (MiWRC)
  • Bag of specials for detector : Mish act., SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS

Results

A FPS /mAP (@Iou50) comparison to other detecors can be seen below :

images