Dense Net
A summary of the Dense Net paper, the winner of the 2017 CVPR bset paper award.
Dense Net paper summary
What did they try to accomplish ?
- improve CNNs : fight the vanishing gradient problem, improve regularization, remove redundancy (redundant layers/neurons) existing in current CNNs (like ResNets), which in turn recudes the number of parameters
Key elements
Concat and Dense conectivity
concatenating feature maps, instead of using the classic ResNet skip connection function :
=> lth
layer is used as input to (l+1)th
layer => xl = Hl(xl-1)
- Dense Nets concatenate feature maps of the same size, which means it has
L*(L+1)/(2)
connections instead ofL
in a normal network, whereL
is the number of layers. Consequently every Dense Net layer has access to the feature maps of the preceeding layers :
The activation function Hl is a composite function with 3x3 convolutions, Batch Norm and ReLu activations.
Pooling /Transition Layers
When the size of feature maps changes, concatenation is not viable. The network is divided into several Dense Blocks, in between those 2x2 average pooling with 1x1 conv filters and batch norm are applied forming a “transition layer”.
Growth rate
The growth rate k
is a hyperparameter which regulates how much a layer contributes to the global state. If each composite function Hl
produces k
feature maps, the lth
layer has k0 + k * (l-1)
feature-maps, where k0
is the number of channels in the input layer. It has to be noted that DenseNets use narrow layers, with k=12
.
Bottleneck layers
To reduce the amount of input channels (for compute efficiency), bottleneck layers are used with 1x1 convs
before the 3x3 convs
applied.
Compression
Compression is used to reduce the number of feature maps at transition layers, if a dense block contains m
feature maps, the transition layer will generate a*m
feature maps,where 0 < a <= 1
with a = 0.5
in most cases.
Implementation Details
- Kaiming/He init. is used
- Zero padding is used @ each Dense block
- Global pooling after last Dense block, with Softmax activation
- 3 Dense blocks are used with all datasets except for ImageNet
- Weight decay of 10e-4
- Nesterov momentum of 0.9
- ImageNet implementation uses 7x7 convs instead of 3x3
Results and Conclusion
- Bottleneck impact decreases with depth of the network
- not the same regularization issues as with ResNets1
- Dense Net BC with 15.3 Million params outperforms much larger Fractal Net (comparable to ResNet-1001), with DenseNet having 90% fewer parameters
- a DenseNet with as much compute complexity (FLOPS) as ResNet-50 performs on par with ResNet-101
- DenseNet with 0.8 Million parameters performs as good as ResNet with 10.2 Millon parameters
- Deep Supervision is achieved with a single classifier. This provides easier loss functions and doesn’t need a multi classifier (like Inception).
- The intuition behind the good performance of DenseNets : architecture style is similar to a ResNet trained with stochastic depth, that means redundant layers are dropped from the beginning allowing smaller Networks