Dense Net paper summary

images

What did they try to accomplish ?

  • improve CNNs : fight the vanishing gradient problem, improve regularization, remove redundancy (redundant layers/neurons) existing in current CNNs (like ResNets), which in turn recudes the number of parameters

Key elements

Concat and Dense conectivity

concatenating feature maps, instead of using the classic ResNet skip connection function :

images

images

=> lth layer is used as input to (l+1)th layer => xl = Hl(xl-1)

  • Dense Nets concatenate feature maps of the same size, which means it has L*(L+1)/(2) connections instead of L in a normal network, where L is the number of layers. Consequently every Dense Net layer has access to the feature maps of the preceeding layers :

images

The activation function Hl is a composite function with 3x3 convolutions, Batch Norm and ReLu activations.

Pooling /Transition Layers

When the size of feature maps changes, concatenation is not viable. The network is divided into several Dense Blocks, in between those 2x2 average pooling with 1x1 conv filters and batch norm are applied forming a “transition layer”.

Growth rate

The growth rate k is a hyperparameter which regulates how much a layer contributes to the global state. If each composite function Hl produces k feature maps, the lth layer has k0 + k * (l-1) feature-maps, where k0 is the number of channels in the input layer. It has to be noted that DenseNets use narrow layers, with k=12.

Bottleneck layers

To reduce the amount of input channels (for compute efficiency), bottleneck layers are used with 1x1 convs before the 3x3 convs applied.

Compression

Compression is used to reduce the number of feature maps at transition layers, if a dense block contains m feature maps, the transition layer will generate a*m feature maps,where 0 < a <= 1 with a = 0.5 in most cases.

Implementation Details

  • Kaiming/He init. is used
  • Zero padding is used @ each Dense block
  • Global pooling after last Dense block, with Softmax activation
  • 3 Dense blocks are used with all datasets except for ImageNet
  • Weight decay of 10e-4
  • Nesterov momentum of 0.9
  • ImageNet implementation uses 7x7 convs instead of 3x3

Results and Conclusion

images

  • Bottleneck impact decreases with depth of the network
  • not the same regularization issues as with ResNets1
  • Dense Net BC with 15.3 Million params outperforms much larger Fractal Net (comparable to ResNet-1001), with DenseNet having 90% fewer parameters
  • a DenseNet with as much compute complexity (FLOPS) as ResNet-50 performs on par with ResNet-101
  • DenseNet with 0.8 Million parameters performs as good as ResNet with 10.2 Millon parameters
  • Deep Supervision is achieved with a single classifier. This provides easier loss functions and doesn’t need a multi classifier (like Inception).
  • The intuition behind the good performance of DenseNets : architecture style is similar to a ResNet trained with stochastic depth, that means redundant layers are dropped from the beginning allowing smaller Networks

References that are interesting to follow