Dense Net

Dense Net paper summary

Dense Net paper summary

images

What did they try to accomplish ?

improve CNNs : fight the vanishing gradient problem, improve regularization, remove redundancy (redundant layers/neurons) existing in current CNNs (like ResNets), which in turn recudes the number of parameters

Key elements

Concat and Dense conectivity

concatenating feature maps, instead of using the classic ResNet skip connection function :

images

=> lth layer is used as input to (l+1)th layer => xl = Hl(xl-1)

Dense Nets concatenate feature maps of the same size, which means it has L*(L+1)/(2) connections instead of L in a normal network, where L is the number of layers. Consequently every Dense Net layer has access to the feature maps of the preceeding layers :

images

The activation function Hl is a composite function with 3x3 convolutions, Batch Norm and ReLu activations.

Pooling /Transition Layers

When the size of feature maps changes, concatenation is not viable. The network is divided into several Dense Blocks, in between those 2x2 average pooling with 1x1 conv filters and batch norm are applied forming a “transition layer”.

Growth rate

The growth rate k is a hyperparameter which regulates how much a layer contributes to the global state. If each composite function Hl produces k feature maps, the lth layer has k0 + k * (l-1) feature-maps, where k0 is the number of channels in the input layer. It has to be noted that DenseNets use narrow layers, with k=12.

Bottleneck layers

To reduce the amount of input channels (for compute efficiency), bottleneck layers are used with 1x1 convs before the 3x3 convs applied.

Compression

Compression is used to reduce the number of feature maps at transition layers, if a dense block contains m feature maps, the transition layer will generate a*m feature maps,where 0 < a <= 1 with a = 0.5 in most cases.

Implementation Details

Kaiming/He init. is used
Zero padding is used @ each Dense block
Global pooling after last Dense block, with Softmax activation
3 Dense blocks are used with all datasets except for ImageNet
Weight decay of 10e-4
Nesterov momentum of 0.9
ImageNet implementation uses 7x7 convs instead of 3x3

Results and Conclusion

images

Bottleneck impact decreases with depth of the network
not the same regularization issues as with ResNets1
Dense Net BC with 15.3 Million params outperforms much larger Fractal Net (comparable to ResNet-1001), with DenseNet having 90% fewer parameters
a DenseNet with as much compute complexity (FLOPS) as ResNet-50 performs on par with ResNet-101
DenseNet with 0.8 Million parameters performs as good as ResNet with 10.2 Millon parameters
Deep Supervision is achieved with a single classifier. This provides easier loss functions and doesn’t need a multi classifier (like Inception).
The intuition behind the good performance of DenseNets : architecture style is similar to a ResNet trained with stochastic depth, that means redundant layers are dropped from the beginning allowing smaller Networks

Dense Net paper summary

What did they try to accomplish ?

Key elements

Concat and Dense conectivity

Pooling /Transition Layers

Growth rate

Bottleneck layers

Compression

Implementation Details

Results and Conclusion

References that are interesting to follow