Semantic Segmentation using Fully Convolutional Networks over the years

Introduction

Semantic Segmentation of an image is to assign each pixel in the input image a semantic class in order to get a pixel-wise dense classification. While semantic segmentation / scene parsing has been a part of the computer vision community since 2007, a major breakthrough came when fully convolutional neural networks were first used by Long et al. 2014 to perform end-to-end segmentation of natural images.

Figure: Example of semantic segmentation (Left) generated by FCN-8s overlayed on the input image (Right)

The FCN-8s architecture achieved a 20% relative improvement to 62.2% mean IU on Pascal VOC 2012 dataset. This architecture was a baseline for semantic segmentation on top of which several newer and better architectures were developed.

Fully Convolutional Networks (FCNs) are being used for semantic segmentation of natural images, for multi-modal medical image analysis and multispectral satellite image segmentation. Very similar to deep classification networks like AlexNet, VGG, ResNet etc., there is also a large variety of deep architectures that perform semantic segmentation.

I summarize networks like FCN, SegNet, U-Net, FC-DenseNet, E-Net & Link-Net, RefineNet, PSPNet, Mask R-CNN, and some semi-supervised approaches like DecoupledNet and GAN-SS here and provide reference PyTorch and Keras implementations for a number of them.

Network Architectures

A general semantic segmentation architecture can be broadly thought of as an encoder network followed by a decoder network. The encoder is usually a pre-trained classification network like VGG/ResNet followed by a decoder network. The decoder network/mechanism is mostly where these architectures differ. The task of the decoder is to semantically project the discriminative features (lower resolution) learnt by the encoder onto the pixel space (higher resolution) to get a dense classification.

Available implementations:

A more formal summarization of semantic segmentation (including recurrent style networks) can also be found here.

Fully Convolution Networks (FCNs)

CVPR 2015 · arXiv
We adapt contemporary classification networks (AlexNet, VGG net, GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.

FCN architecture

FCN transforming FC layers to convolutions

Key features:

FCN-32s Architecture

The fully connected layers (fc6, fc7) of classification networks like VGG16 were converted to fully convolutional layers. This produces a class presence heatmap in low resolution, which then is upsampled using bilinearly initialized deconvolutions and at each stage of upsampling further refined by fusing features from coarser but higher resolution feature maps from lower layers in VGG16 (conv4 and conv3).

In conventional classification CNNs, pooling is used to increase the field of view and at the same time reduce the feature map resolution. While this works best for classification, for semantic segmentation, any sort of operation that reduces spatial resolution is detrimental as spatial information is lost. Most architectures differ mainly in the mechanism employed in the decoder to recover the information lost in the encoder.

Deconvolution Dilated Convolution

Other important aspects include the mechanism used for feature upsampling using learned deconvolutions, or partially avoiding the reduction of resolution altogether in the encoder using dilated convolutions at the cost of computation.

SegNet

2015 · arXiv
The novelty of SegNet lies in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample.

SegNet Architecture

Key features:

Max Unpooling

As shown above, the indices at each max-pooling layer in encoder are stored and later used to upsample the corresponding feature map in the decoder by unpooling it using those stored indices. While this helps keep the high-frequency information intact, it also misses neighboring information when unpooling from low-resolution feature maps.

U-Net

MICCAI 2015 · arXiv
The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method on the ISBI challenge.

U-Net Architecture

U-Net achieved state-of-the-art results on EM Stacks dataset which contained only 30 densely annotated medical images, and was later extended to a 3D version 3D-U-Net. It has found use in several fields including satellite image segmentation and medical image analysis.

Fully Convolutional DenseNet

2016 · arXiv
We extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining.

FC-DenseNet Architecture

Fully Convolutional DenseNet uses a DenseNet as its base encoder and, similar to U-Net, concatenates features from encoder and decoder at each rung.

E-Net and Link-Net

ENet is up to 18x faster, requires 75x less FLOPs, has 79x less parameters, and provides similar or better accuracy to existing models. LinkNet can process an input image of resolution 1280x720 on TX1 and Titan X at 2 fps and 19 fps respectively.

LinkNet Architecture LinkNet blocks

The LinkNet Architecture resembles a ladder network architecture where feature maps from the encoder (laterals) are summed with the upsampled feature maps from the decoder (verticals). The decoder block consists of considerably fewer parameters due to its channel reduction scheme.

Mask R-CNN

2017 · arXiv
Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. It is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.

Mask R-CNN Architecture Mask R-CNN pipeline

Key features:

PSPNet

CVPR 2017 · arXiv
We exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet).

PSPNet Architecture

Key features:

The PSPNet architecture is currently the state-of-the-art in CityScapes, ADE20K and Pascal VOC 2012. A full visualization of the network can be found here.

RefineNet

CVPR 2017 · arXiv
RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections.

RefineNet Architecture RefineNet blocks

Key features:

G-FRNet

CVPR 2017 · PDF
Gated Feedback Refinement Network (G-FRNet), an end-to-end deep learning framework for dense labeling tasks. We introduce gate units that control the information passed forward in order to filter out ambiguity.

G-FRNet Architecture Gated Refinement Unit

Most architectures rely on simple feature passing using concatenation, unpooling or sum. However, information that flows from higher resolution layers may or may not be of utility for segmentation. Gating the information flow from encoder to decoder using Gated Refinement Feedback Units can assist the decoder in resolving ambiguities. The experiments in this paper reveal that ResNet is a far superior encoder than VGG16 for semantic segmentation.

Semi-Supervised Semantic Segmentation

DecoupledNet

NIPS 2015 · arXiv
Our algorithm decouples classification and segmentation, and learns a separate network for each task. Labels associated with an image are identified by classification network, and binary segmentation is subsequently performed for each identified label.

DecoupledNet Architecture

GAN Based Approaches

2017 · arXiv

Weakly Supervised GAN Semi-Supervised GAN

Datasets

DatasetTrainingTestingClasses
CamVid46823311
Pascal VOC 20129,9631,44720
NYUDv279564540
Cityscapes2,97550019
Sun-RGBD10,3552,86037
MS COCO '1580,00040,00080
ADE20K20,2102,000150

Results

Sample semantic segmentation maps generated by FCN-8s (trained using the pytorch-semseg repository) overlaid on input images from Pascal VOC validation set:

Input 0 Segmentation 0 Input 1 Segmentation 1 Input 2 Segmentation 2 Input 3 Segmentation 3 Input 4 Segmentation 4 Input 5 Segmentation 5

In case this doesn't work for you, or if there is a mistake/typo, please open an issue in the repo.