当前位置:网站首页>Cvpr2021 Huawei Noah laboratory proposes transformer in transformer

Cvpr2021 Huawei Noah laboratory proposes transformer in transformer

2021-05-04 16:49:48 CV technical guide (official account)

Preface :

transformer There are more and more applications for image , The main method is to block the image , Forming block sequence , Simply drop the block directly into transformer in . However, this approach ignores the internal structure information between blocks , So , In this paper, we propose a new algorithm that utilizes both the sequence information within and between blocks transformer Model , be called Transformer-iN-Transformer, abbreviation TNT.

 

The main idea

 picture

 

TNT The model divides an image into a sequence of blocks , Each piece reshape It's a sequence of pixels . After linear transformation, it can be obtained from blocks and pixels patch embedding and pixel embedding. Put the two on top of each other TNT block Middle school learning .

stay TNT block Zhongyou outer transformer block and inner transformer block form .

outer transformer block Responsible for modeling patch embedding Global correlation on ,inner block Responsible for modeling pixel embedding Local structure information between . Through the pixel embedding Linear mapping to patch embedding The way of space patch embedding Fusion of local information . In order to keep the spatial information , Position coding is introduced . Last class token Through one MLP Used for classification .

By proposing TNT Model , We can model global and local structural information , And improve the ability of feature representation . In terms of accuracy and computation ,TNT stay ImageNet and downstream  Excellent performance on the mission . for example ,TNT-S Where ImageNet top-1 On the Internet, only 5.2B FLOPs Under the premise of 81.3%, Than DeiT High  1.5%.

 

Some details

 picture

Compare this picture , Introduce it with several formulas .

 picture

MSA by Multi-head Self-Attention.

MLP by Multi Layer Perceptron.

LN by Layer Normalization.

Vec by flatten.

The plus sign indicates the residual connection .

The first two formulas are inner transformer block, Processing information inside a block , The third formula is to linearly map the information inside the block to patch embedding Space , The last two formulas are outer transformer block, Processing information between blocks .

 

It's enough to look at the figure below in the way of location coding .

 picture

 

The model parameters and calculations are shown in the table below :

 

 picture

 

 

Conclusion

 

 picture

 

Recently put the public account (CV Technical guide ) All the technical summaries are packaged into one pdf, Reply to key words in official account “ Technical summary ” Available .

 picture

This article comes from the official account CV Technical summary series of technical guide , For more details, please scan the end of the code for the official account. .

 

版权声明
本文为[CV technical guide (official account)]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/05/20210504164408444b.html

随机推荐