These five computer vision technologies refresh your worldview（4）

4--semantic segmentation

At the heart of computer vision is the segmentation process, which divides the entire image into groups of pixels that can then be labeled and classified. In particular, semantic segmentation attempts to semantically understand the role of each pixel in an image (for example, is it a car, a motorcycle, or another type of class?). For example, in the above picture, in addition to identifying people, roads, cars, trees, etc., we must also depict the boundaries of each object. Therefore, unlike classification, we need to perform dense pixel-by-pixel prediction from the model.

Like other computer vision tasks, CNN has had great success in segmentation. One of the popular initial methods is to perform a patch classification through a sliding window in which each pixel is divided into classes using its surrounding images. However, this is computationally very inefficient because we do not reuse the shared features between overlapping patches.

Instead, the solution is the University of California, Berkeley's Full Convolutional Network (FCN), which promotes an end-to-end CNN architecture for intensive prediction without any fully connected layers. This allows split graphs to be generated for images of any size and is much faster than patch sorting methods. Almost all subsequent semantic segmentation methods use this paradigm.

However, there is still a problem: the convolution at the original image resolution will be very expensive. To solve this problem, the FCN uses downsampling and upsampling inside the network. The downsampling layer is called stripe convolution and the upsampling layer is called deconvolution.

Although the upsampling/downsampling layer is used, the FCN generates a coarse segmentation map due to information loss during pooling. SegNet is a more efficient memory architecture than FCNs that use the largest pooling and encoding-decoder framework. In SegNet, a fast/jump connection is introduced from a higher resolution feature map to improve the upsampling/downsampling roughness.