当前位置:网站首页>Concept of the concept series_ v2-v3

Concept of the concept series_ v2-v3

2020-11-10 18:59:50 As if there was light 157

Inception Series of Inception_v1

Inception Series of Batch-Normalization

introduction :

    Inception_v2 and Inception_v3 It's in the same paper , Put forward BN It's not Inception_v2. The difference between the two is 《Rethinking the Inception Architecture for Computer Vision》 A variety of design and improvement techniques are mentioned in this paper , Using some of these structures and improvements is Inception_v2, All used are Inception_v3.

 

Model design principles

 

    Inception_v1 Because of the complexity of the structure , It's hard to build on it , If you change its structure at will , It is easy to lose part of the calculation income directly . meanwhile ,Inception_v1 There is no detailed description of the factors of each decision design in the paper , This makes it difficult to simply adjust to new applications . So ,Inception_v2 The paper introduces the following basic design principles in detail , Based on these principles, some new structures are proposed .

 

    1. Avoid expressing bottlenecks , Especially in the shallow layers of the network . The size of each layer of a feedforward network should gradually decrease from input to output .( When the size is not such a change, there is a bottleneck )

 

    2. High dimensional representations are easy to handle in the network , Increasing the number of activation functions makes it easier to parse features , Will also make the network training faster .( This principle means that the higher the dimension , The more suitable to use the network to deal with , For example, data classification on two-dimensional plane is not suitable for network processing , Increasing the number of activation functions makes it easier for the network to learn its representation features )

 

    3. Spatial aggregation can be performed on lower dimensional embeddings , Without losing a lot of presentation power . for example , More decentralized in execution ( for example 3×3) Before convolution , It can be done before space gathers ( Shallow ) Reduce the size of the input representation , There will be no serious adverse effects . We assume that the reason for this is that , If in a spatial aggregation environment ( Middle and high level ) Use the output , The strong correlation between adjacent elements will result in much less information loss during size reduction . Since these signals should be easy to compress , So reducing size can even promote faster learning .

 

    4. Balance the width and depth of the network . By balancing the number of filters and the depth of the network at each stage , Can achieve the best performance of the network . Increasing the width and depth of the network can help improve the quality of the network . however , If you add both in parallel , The optimal improvement of the constant calculation amount can be achieved . therefore , The calculation budget should be distributed in a balanced way between the depth and width of the network .

 

Some special structures

01   Convolution decomposition

    One 5x5 The convolution kernel of can pass through two continuous 3x3 Convolution kernel to replace , The first of them is normal 3x3 Convolution , The second convolution is at the upper level 3x3 Full connection based on convolution . The advantage of doing this is to realize 5x5 Convolution should have a receptive field , It also implements smaller parameters 2x9/25, It's probably narrowed down 28%. The details are as follows fig1 Shown . Further more , Using asymmetric decomposition , Will a 3x3 The decomposition of convolution is 3x1 and 1x3. The details are shown in the figure on the right fig2.

 

 

    So the original Inception structure ( On the left fig3) It can be transformed into the structure shown below ( Chinese fig5) and ( Right picture fig6).

 

 

 

     Finally, it is derived as shown in the figure below (fig7) A structure that mixes two ways of decomposition .

     in application , Using such a decomposition structure is not good at the lower level of the network . It's in medium size (mxm Of feature map among m stay 12 To 20 Within the scope of ) There will be a better effect in the layer of . This is taking into account the second principle , In this way Inception The structure will be in the middle of the network , In the lower layer of the network, the structure of general convolution network is still used .

 

02   The utility of auxiliary classifiers

     Auxiliary classifiers don't work well in the early stage of training , At the late stage of training, it began to surpass the network without auxiliary classifiers , And reach a slightly higher plateau . also , After removing these two auxiliary classifiers, there is no adverse effect , So in Inception_v1 The idea of helping lower level networks train faster is problematic . If these two branches have BN or Dropout, The main classifier works better , This is a BN Can serve as a weak proof of the regularizer .

 

03   High efficiency and low efficiency Grid Size

     About lowering Grid Size The size of the way , There are two approaches as shown in the figure above . The one on the left violates the first principle , That is, the size should decrease layer by layer , Otherwise bottleneck. The picture on the right conforms to the first principle , However, such a large number of parameters . For this reason, the author puts forward a kind of picture as follows (fig10) The new way shown . Parallel operation , The use of step size is 2 Pooling and convolution operations , Achieve reduction without violating the first principle Grid Size.

     complete Inception_v2 Structure diagram is as follows :

     It's not used in the whole structure padding, Proposed fig10 The structure is used in the middle, each of them Inception Between modules .

 

04

Model regularization through label smoothing

     If the model is trained so that all probability values are given to ground truth label , Or make the biggest Logit The output value is as different as possible from the other values , Intuitively speaking, the model is more confident when predicting , There will be over fitting , There is no guarantee of generalization . Therefore, it is necessary to smooth the label .

     Ahead δk,y It's a Dirac function , I.e. category k = y, That is to say 1, Otherwise 0. The original tag vector q(k|x) = δk,y. After the label smoothing, the label vector becomes the following formula .

     there ∈ Is a super parameter ,u(k) take 1/k,k Indicates the number of categories . That's the new tag vector ( Suppose it's three categories ) Will be (∈/3, ∈/3, 1-2∈/3 ), And the original tag vector is (0,0,1).

Conclusion

 

 

     The actual effect is as shown in the figure , Explain here Inception_v2 And Inception_v3 The difference between ,Inception_v2 The use of Label Smoothing or BN-auxiliary or RMSProp or Factorized One or more of the technologies Inception modular . and Inception_v3 All of these technologies are used Inception modular .

 

If there is a mistake or irrationality , Welcome to correct in the comments .

Welcome to the official account “CV Technical guide ”, The main direction of computer vision paper interpretation , The latest technology tracking , as well as CV Summary of Technology .

 

版权声明
本文为[As if there was light 157]所创,转载请带上原文链接,感谢