transformer There are more and more applications for image , The main method is to block the image , Forming block sequence , Simply drop the block directly into transformer in . However, this approach ignores the internal structure information between blocks , So , In this paper, we propose a new algorithm that utilizes both the sequence information within and between blocks transformer Model , be called Transformer-iN-Transformer, abbreviation TNT.
The main idea
TNT The model divides an image into a sequence of blocks , Each piece reshape It's a sequence of pixels . After linear transformation, it can be obtained from blocks and pixels patch embedding and pixel embedding. Put the two on top of each other TNT block Middle school learning .
stay TNT block Zhongyou outer transformer block and inner transformer block form .
outer transformer block Responsible for modeling patch embedding Global correlation on ,inner block Responsible for modeling pixel embedding Local structure information between . Through the pixel embedding Linear mapping to patch embedding The way of space patch embedding Fusion of local information . In order to keep the spatial information , Position coding is introduced . Last class token Through one MLP Used for classification .
By proposing TNT Model , We can model global and local structural information , And improve the ability of feature representation . In terms of accuracy and computation ,TNT stay ImageNet and downstream Excellent performance on the mission . for example ,TNT-S Where ImageNet top-1 On the Internet, only 5.2B FLOPs Under the premise of 81.3%, Than DeiT High 1.5%.
Compare this picture , Introduce it with several formulas .
MSA by Multi-head Self-Attention.
MLP by Multi Layer Perceptron.
LN by Layer Normalization.
Vec by flatten.
The plus sign indicates the residual connection .
The first two formulas are inner transformer block, Processing information inside a block , The third formula is to linearly map the information inside the block to patch embedding Space , The last two formulas are outer transformer block, Processing information between blocks .
It's enough to look at the figure below in the way of location coding .
The model parameters and calculations are shown in the table below ：
Recently put the public account (CV Technical guide ) All the technical summaries are packaged into one pdf, Reply to key words in official account “ Technical summary ” Available .
This article comes from the official account CV Technical summary series of technical guide , For more details, please scan the end of the code for the official account. .