当前位置:网站首页>Detailed explanation of fast SCNN semantic segmentation network

Detailed explanation of fast SCNN semantic segmentation network

2021-09-15 04:07:23 AI bacteria

One 、 Brief introduction

This article is published in BMVC2019, By Toshiba Research Institute Rudra、Stephan And Cambridge University Roberto Jointly completed . The highlight of this paper is to propose a fast semantic segmentation network Fast-SCNN, For high resolution (1024×2048) Images , stay NVIDIA Titan XP GPU The test shows that , stay Cityscapes On dataset mIOU achieve 68.0%, Speed reached 123.5 frame / second .

 Insert picture description here

Two 、 Main contributions

The main contributions of this paper , There are the following aspects :

  • Adjusted shortcut connection , A shallow learning to downsample modular , Can pass through quickly and efficiently multi-branch To extract low-level features .
  • It is verified that large-scale pre training is not necessary , You can also increase the number of rounds of training to achieve a close effect .
  • It combines classic compilation and interpretation - The idea of coder framework and multi branch framework , A new real-time semantic segmentation architecture is proposed Fast-SCNN.

3、 ... and 、 Related background

Semantic segmentation usually consists of Encoder - Decoder frame Deep convolution neural network (DCNN) To deal with it , Many runtime efficient implementations use Two or more branches Architecture .

Usually , For designing semantic segmentation network architecture , We need to pay attention to the following aspects :

  • Larger receptive fields For understanding the complex associations between object classes ( That is, the global context ) It's very important
  • In the picture Spatial details Is necessary to maintain object boundaries
  • Specific designs are needed to balance speed and accuracy

(1) Encoder - Decoder architecture

The encoder uses convolution and pooling operations to extract the characteristics of deep convolution network , The decoder recovers spatial information from low resolution features , Then through the pixel classification layer (softmax) To predict object labels . among , Encoders usually use VGG、ResNet To build , The decoder is constructed by up sampling module .

Here is the classic encoder - Decoder architecture , The initial semantic segmentation network adopts this structure , such as FCN、SegNet、UNet Such semantic segmentation network .

 Insert picture description here

(2) Multi branch structure

In a two branch network : Under the low resolution input branch , Use deeper CNN To capture the global context ; Under the full resolution input branch , Use shallow branches to learn spatial details ; then , The final semantic segmentation result is provided by merging the two . In this way , It reduces the computing cost of the network , So that it can be in general GPU Run in real time .

The following is the classic multi branch structure , The real-time performance of semantic segmentation networks using this architecture is generally high , such as ICNet、ContextNet、BiSeNet and GUN Some semantic segmentation networks .

 Insert picture description here

(3)Fast-SCNN

What this article puts forward Fast-SCNN It's a kind of It combines classic compilation and interpretation - Coder framework and multi branch framework Real time semantic segmentation algorithm .

 Insert picture description here

Four 、 Network architecture

Fast-SCNN The overall network architecture is as follows , It consists of four parts : Learn the down sampling module 、 Global feature extractor 、 Feature fusion module and standard classifier , All modules are constructed by deep separable convolution .

 Insert picture description here

(1) Learn the down sampling module

In the learning down sampling module , It adopts a three-layer structure . Only three layers are used to ensure Low level feature sharing Effective and efficient implementation of . The first layer is the standard convolution layer (Conv2D), The other two layers are convolution layers that can be separated along the depth (DSConv). What we want to emphasize here is , although DSConv Is more computationally efficient , But in the first layer, we still use Conv2D, Because the input image has only three channels , This makes DSConv The computational advantage of is insignificant at this stage .

All three layers in the learning down sampling module use a step of 2、 The convolution kernel size is 3×3 The convolution of layer , The convolution layer is followed by batch normalization and RELU Activation function .

(2) Global feature extractor

The global feature extractor module is designed to Capture the global context for image segmentation . Compared with the ordinary two branch method operating on the low resolution version of the input image , Our module directly brings the learning output to the down sampling module ( It is the original input 1/8 The resolution of the ).

 Insert picture description here

The detailed structure of the module is shown in table 1 Shown . We used the MobileNet-V2 Introduced bottleneck residual block( surface 2). especially , When the input and output sizes are the same , We are right. bottleneck residual block Residual connection . our bottleneck residual block Efficient depth separable convolution is used , This reduces the number of parameters and floating-point operations . Besides , Add a pyramid pool module at the end (PPM) , To aggregate context information based on different regions .
 Insert picture description here

(3) Feature fusion module

And ICNet and ContextNet similar , By simply merging different branch To ensure effectiveness . perhaps , More complex feature fusion modules can be used at the cost of runtime performance ( for example Bisenet), In order to achieve higher accuracy . The details of the feature fusion module are shown in table 3 Shown :
 Insert picture description here

(4) Standard classifier

In the classifier , We use two deeply separable convolutions (DSConv) And point by point convolution (Conv2D). We found that , Adding several layers after the feature fusion module can improve the accuracy . The details of the classifier module are shown in table 1 Shown .
 Insert picture description here

Use... During training Softmax, Because gradient descent is used . In the process of reasoning , We can use argmax Replace expensive softmax Calculation , Because these two functions are monotonically increasing . We express this option as Fast-SCNN cls( classification ). On the other hand , If required, based on standards DCNN The probability model of , Then use SoftMax, Expressed as Fast-SCNN Prob( probability ).

5、 ... and 、 experimental result

(1) Experimental environment and parameter setting

Use Python stay TensorFlow Experiments are carried out on the machine learning platform . Our experiment is in NVIDIA Titan X(Maxwell) or NVIDIA Titan XP(Pascal)GPU、CUDA 9.0 and CuDNN v7 On your workstation . Runtime evaluation is performed on a single CPU Thread and a GPU In the implementation of , To measure forward reasoning time .

The momentum used is 0.045, Batch is 12 Of Stochastic gradient descent (SGD). suffer [4,37,10] Inspired by the , We use Poly Learning rate , The base number is 0.045, Power is 0.9. Be similar to MobileNet-V2, We find that depth convolution does not require L2 Regularization , For other layers L2 The regularization parameter is 0.00004.

Due to the limited training data for semantic segmentation , Various methods were used in the experiment Data enhancement technology :0.5 To 2 Between Random resizing 、 translation / tailoring 、 Flip horizontal 、 Color channel noise and brightness . Our model is trained in the case of cross entropy loss . We found that , The auxiliary loss at the end of learning has 0.4 The global feature extraction module of weight is advantageous .

Batch normalization is used before each nonlinear function .Dropout Only for the last layer , It happens to be Softmax Before layer . And MobileNet and ContextNet contrary , We found that Fast-SCNN Use RELU Faster training , The accuracy is slightly higher than ReLU6, This is true even with the deep separable convolution we use throughout the model .

(2) stay Cityscapes Experimental results on

Cityscapes It is the largest publicly available data set on urban roads . The dataset contains data from Europe 50 Various high-resolution images taken in different cities (1024×2048px). It has 5000 A high label quality image :2975 Zhang's training set 、500 Zhang's verification set and 1525 Zhang's test set .

stay Cityscapes Evaluate the overall performance on the test set , Compared with other real-time semantic segmentation methods (ContextNet、BiSeNet、GUN、ENET and ICNet) And offline semantic segmentation method (PSPNet 、DeepLab-V2) Comparison , Average accuracy mIOU As shown in the figure below :
 Insert picture description here
At NVIDIA Titan X and Titan Xp( belt * Number ) Test on , The detection speed is compared with that of different segmented networks :
 Insert picture description here

(3) Test performance on weak labels

Use weak labels Coarse、 Pre training model ImageNet In the form of , Several different groups of experiments were carried out , The comparison of experimental results is shown in the table below :
 Insert picture description here
It can be seen from the experimental results that , At low capacity DCNN Attach weak labeled data to training , No significant improvement in sex .

(4) Large scale pre training test results

The following figure shows the use of weak labels Coarse、 Pre training model ImageNet Test data during training :
 Insert picture description here
As you can see from the diagram , Whether using weak labels Coarse, Or a pre training model ImageNet The way , The final accuracy tends to be the same . therefore , This experiment can reflect : Weak label Coarse, Pre training model ImageNet For low capacity Fast SCNN Not much , even to the extent that , As long as you train long enough , Not pass Coarse or ImageNet The training model can achieve the same effect .


The best relationship is Each achievement , Everyone 「 Three even 」 Namely 【AI bacteria 】 The greatest power of creation , See you next time !

 Insert picture description here

版权声明
本文为[AI bacteria]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909111002661s.html

随机推荐