PANet Paper Summary

images

improve Computer Vision tasks object detection and mainly instance segmentation
build on top of Fast/ Faster /Mask RCNN and improve info propagation
design an architecture that can deal with blurry and heavy occlusion of the new datasets back then (2018) , like COCO 2017
use in-net feature hierarchy : top down path with lateral connections is augmented to emphasize strong sementical features

Mask RCNN : long path from low-level to topmost features, which makes it difficult to access localization information
Mask RCNN : only single view, multi view preferd to gather diverse information
Mask RCNN : predictions based on pooled feature grides that are assigned heuristically, can be updated since lower level info can be important for final prediction

PANet is proposed for instance segementation
1) bottom-up path augmentation to shorten information path and improve feature pyramid with accurate low-level information => new : propagate low-level features to enhance the feature hierarchy for instance recogniton
2) Adaptive feature pooling is introduced to recover broken information between each proposal and all feature levels
3) for multi view : augmentation of mask prediction with small fc layers : more diverse info, masks have better quality
1) & 2) are both used for detection and segmentation and lead to improvements of both tasks

images

Bottom-up Path Augmentation

Intuition : bottom up is augmented to easily propagate lower layers
We know that high layers respond to whole objects, lower layers to fine features
Localization can be enhanced with top-down paths (FPN)
Here a path from low levels to higher ones is built, based on higher layers respose to edges and instance parts which helps localization error
Approach follow FPN, also using ResNet : layers with same spatial size are in same feature space (b) in figure 1)
As shown in figure 2), each feature map takes a higher resolution feature map $N_{i}$ and a coarser map $P_{i+1}$ and generates a new one using a 3x3 conv with stride 2 for size reduction on each $N_{i}$ map. After that each element of $P_{i+1}$ and the down sampled map are added using lateral connection. The fused map is convoluted using another 3x3 kernel to generate $N_{i+1}$ , the whole process is iterated until $P_{5}$ is reached. All convs are followed by a ReLU.
up to 0.9 mAP improvement with large input sizes

images

Adaptive Feature Pooling

Idea : Adapative feature pooling allows each propsal to access info from all levels
In FPN, small proposals are assigned to low level features and high proposals to higher level ones . This can be non-optimal, as e.g. 2 examples with 10-pixel difference can be assigned to different levels even though they are rather similar. Also features may not correlate strongly with the layer they belong to.
High-level features have a larger receptive field and a more global context, whereas lower ones have fine details and high-localization accuracy. Therefore pooling from all levels and all proposals is fused for prediction. The authors call it adaptive feature pooling. For fusion max operations are used. For each level a ratio of kept features is calculated, surprisingly almost 70% are from other higher levels. The findings show that, features from multi levels together are helpful for accurate prediction. An intuition that is similar to DenseNet.This also supports bottom-up augmentation.
The exact process can be seen in figure 1 c), at first each proposal is mapped to different feature levels. Then ROIAlign is used to pool grids from each level. After that fusion of feature grids from different levels is performed using an element-wise max or sum. The focus is on in net feature hierarchy, instead of using different levels from image pyramids. It is comparable to L2 norm, where concat and dimension reduction are used.

Fully Connected Fusion

Results have shown that MLP is also useful for pixelwise mask prediction. FC layers have different properties compared to FCN, FCN shares parameters and predicts based on local receptifve field. FC layers are localy sensitive since they do not use param sharing. They can adapt to different spatial locations. Using these ideas, it is helpful to differentiate parts in an image and use this information by fusing them.
Mask branch operates on pooled features and mainly consists of a FCN net with 4 conv (each 3x3 with 256 filters) and 1 deconv layer (upsample by 2). A shortcut from layer conv3 to fc is also implemented. The fc layer predicts a class agnostic foreground/background mask. It’s efficient and allows for better generality.
up to 0.7 mAP improvement for all scales

Others

Total improvement in AP is 4.4 over baseline, half of it is due to Synch BN and multi scale training

Results

Ablation study was done for architecture design
Winner of 2017 COCO Segmentation, SOTA performance in segmentation (Cityscapes) and detection