YOLOv4, a summary
YOLO is back with new researchers, a summary of version 4
- YOLOv4: Optimal Speed and Accuracy of Object Detection Paper Summary
YOLOv4: Optimal Speed and Accuracy of Object Detection Paper Summary
(Alexey Bochkovskiy,Chien-Yao Wang, Hong-Yuan Mark Liao)
What did the authors want to achieve ?
- fast (real time) object detection, that can be trained on a GPU with 8-16 GB of VRAM
- a model that can be easily trained and used
- add state of the art methods for object detection, building on YOLOv3
- find a good model for both GPU and VPU implementation
Methods used
Bag of Freebies (methods that only increase training runtime and not inference)
- New Data Augmentation techniques are used, for photometric and geometric variability : => Random Erase, Cutout to leave part of image of certain value => Dropout, Dropblock does the same with the net params
- Mixup : mult image augmentation
- Style Transfer GAN for texture stability
=> focal loss, data imbalance between different classics => one-hot hard representation => soft labels BBox regression :
- MSE has x and y independent, also Anchors => IoU loss => coverage and area are considered => scale invariant, not the case with traditional methods => DIoU and CIoU loss
Bag of specials (methods that only have a small impact on inference speed, but improve accuracy significantly)
- only a small cost of compute during inference, improve accuracy :
- enlarging receptive field : SPP, ASPP, RFB, SPP originates from Spatial Pyramid Matching => extract bag of words features - SPP infeasible for FCN nets, as it outputs a 1D feature vector => YOLOv3 concat of max pooling outputs with kxk kernel size => larger receptive field of the backbone => 2.7% higher AP50, 0.5 more compute necessary
- ASPP diff. to SPP : max pool of 3x3, dilation of k
- RFB : several dilated kxk convs => 7% more AP, 5.7% more compute Attention module :
- mainly channelwise and pointwise attention, SE and SAM modules, SAM with no extra cost on GPU => better for us
Feature Integration : - skip connections, hyper-column
- channelwise weighting on multi-scale with FPN methods :
- SFAM, ASFF, BiFPN, …
Activation Functions :
- Mish, Swish, (fully differentiable) PReLu, … Post-processing :
- NMS “messy” with occlusions => DIoU NMS distance : center to BBox screening process
- NMS method not necessary in anchor free method
- SFAM, ASFF, BiFPN, …
Activation Functions :
Architecture Selection
- things to think about : a reference that is good for classification is not always good for object detection, due to the detector needing :
- higher input size (for small objects)
- more layers (higher receptive field to cover larger input)
- more parameters (for greater capacity, to detect different sized objects)
=> a detector needs a backbone with more 3x3 convs and more params
Due to that, CSPDarknet53 seems to be the best choice in theory
The following improvements are done :
- SPP additional module for larger receptive field (almost no runtime disadvantage)
- PANet path-aggregation neck as param aggregation method instead of FPN from YOLOv3
- YOLOv3 anchor based head is used
- DropBlock regularization method
For single GPU training :
- synchBN is not considered : goal is to run on single GPU, thanks guys !!
- new data augmentation mosaic (mixes 4 training images), Self Adversarial Training => detection of objects outside their context
SAT : Forward Backward Training : 1) adversarial attack is performed on input 2) neural net is trained to detect and object on this moded image in a normal way
Also :
- optimal hyper-params while applying genetic algos
- Cross mini Batch Normalization : mini-batch split within batch
- SAM is modified from spatial-wise to pointwise attention, as can be seen below :
Architecture Summary
- Backbone : CSPDarknet53
- Neck : SPP, PAN
- Head : Yolov3
Techniques :
- Bag of freebies for Backbone : CutMix and Mosaic data augmentation, DropBlock reg., class label smoothing
- Bag of freebies for detector : CIoU-loss, CmBN, DropBlock reg., Mosaic data augmentation, SAT, Eliminate gird sensitivity, multi anchors for a single ground truth, cosine annealing, hyperparams, random training shapes
- Bag of Specials for Backbone : Mish activation, Cross-stage partial connections (CSP), Multi- input weighted residual connections (MiWRC)
- Bag of specials for detector : Mish act., SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS
Results
A FPS /mAP (@Iou50) comparison to other detecors can be seen below :