[yolof interpretation] you only look one level feature (CVPR 2021)
2021-09-15 04:07:15 【AI bacteria】
This article is published in CVPR2021, The authors are from the Chinese Academy of Sciences 、 UCAS 、 Open vision . The greatest contribution of this paper is to point out FPN in divide and conquer The importance of strategy , Therefore, a simpler and more effective target detection framework is proposed YOLOF.
List of articles
- One 、YOLOF
- Two 、FPN Introduce
- 3、 ... and 、 Related work
- Four 、 How to improve
- 5、 ... and 、YOLOF Network architecture
- 6、 ... and 、 experimental data
- 7、 ... and 、 summary
The author thinks that FPN（ Characteristic pyramid network ） The success of not only comes from multi-scale feature fusion , It's more about “ Divide and rule ” Detection strategy . Based on this point of view , The author proposes a simple and efficient network architecture YOLOF.
YOLOF Only the first level feature layer is used for detection , And two key components are introduced ： Expansion encoder and uniform matching , It brings good performance improvement . stay COCO A lot of experiments on the benchmark show that ,YOLOF Than YOLOv4 The detection speed is even faster 13%, And the accuracy is quite .
Open source code ：https://github.com/megvii-model/YOLOF
At present ,FPN It has almost become an indispensable part of excellent target detection network .FPN There are two main advantages of ：
- Multi scale feature fusion ： Multiple low resolution and high resolution feature inputs are fused to obtain better representation
- Divide and rule ： Depending on the scale , Detect different levels of targets respectively
FPN A general idea of is , Its success depends on the fusion of multi-level features , This has led to a series of research on complex fusion methods of manual design . However , This concept ignores FPN The role of divide and rule . This leads to questions about how these two benefits promote FPN There are few successful studies , In turn, it may hinder new progress .
therefore , Firstly, this paper studies FPN The two benefits of are one-stage Effects in the detector . By integrating multi-scale feature fusion and divide and conquer function with RetinaNet Decoupling to design experiments .
As shown in the figure below , The author of MiMo、SiMo、MiSo、SiSo Made a comparison , It's amazing , There is only one input feature C5 Without feature fusion SiMo The encoder can be implemented with MiMo Encoder （ namely FPN） Quite good performance , The performance gap is less than 1 mAP. by comparison ,MiSo and SiSo The performance of the encoder drops sharply （≥ 12 mAP）.
These phenomena illustrate two facts ：
- C5 Features have enough context to detect objects of various scales , This makes SiMo The encoder can achieve comparable results
- The benefits of multiscale feature fusion are far less important than the benefits of divide and conquer , Therefore, multi-scale feature fusion may not be FPN The most significant benefit
Target detection using multiple features is a traditional technology . The typical methods of constructing multiple features can be divided into image pyramid method and feature pyramid method . Detector based on image pyramid , for example DPM, In the pre deep learning era, it dominated the detection , Its advantage is that it can be used out of the box , And achieve higher performance .
However , Image pyramid method is not the only way to obtain multiple features ; stay CNN Using the power of feature pyramid in the model is more effective 、 A more natural . SSD Firstly, multi-scale features are used to detect objects with different scales on each scale .FPN A semantic rich feature pyramid is constructed by combining shallow features and deep features . after ,PANet stay FPN Feature fusion is further enriched to obtain better representation .
FPN It seems to have become an essential component , And lead the modern industry . It is also suitable for popular single-stage detectors , for example RetinaNet 、FCOS And its variants . Another way to get the feature pyramid is to use multi branch and extended convolution . Different from the above work , The method in this paper is a single-stage feature detector .
In the early ,R-CNN series and R-FCN Extract only on a single feature RoI features , Their performance lags behind the corresponding multi-level features detector . Besides , stay one stage In the detector ,YOLO and YOLOv2 Use only the last output feature of the trunk , They detect fast , But it must bear the decline of detection accuracy .CornerNet and CenterNet Follow this method and the sampling rate is 4 A single feature to detect all objects while achieving competitive results .
Using high-resolution feature map for detection will bring huge memory cost , And it is not conducive to deployment . lately ,DETR Introduced Transformer To test , And show that it uses only a single C5 features （5 Double down sampling ） You can achieve the most advanced results . because anchor-free The mechanism and transformer learning Stage ,DETR It takes long training to converge .
Different from these papers , In this paper, the working mechanism of multistage detector is studied . From the perspective of optimization , This article is a widely used FPN Provides an alternative solution . Besides ,YOLOF Faster convergence and comparable performance ; therefore ,YOLOF It can be used as a simple baseline for fast and accurate detector .
The author tries to use a simple SiSo Encoder replacement complex MiMo Encoder . But according to the results in the figure below , You can find , Direct application SiSo When encoder , The detection performance of the network will be greatly reduced .
By careful analysis , The author found SiSo Two problems caused by encoder are the reasons for performance degradation ：
- The first question is ： C5 The scale range of receptive field matching in feature layer is limited , This hinders the detection performance of objects with different scales .
- The second question is ： Sparse in single-layer features anchor Proposed positive anchors The imbalance of .
Next , These two problems and the solutions provided in this article will be discussed in detail .
Pictured 4(a) Shown ,C5 Characteristic receptive fields It can only cover a limited range of scales , If the scale of the object does not match the receptive field , This can lead to poor performance . In order to use SiSo The encoder detects the target of all objects , We must find a way to generate output features with various receptive fields , To make up for the lack of multi-level features .
Firstly, the author expands by stacking standard convolution and extended convolution C5 Characteristic receptive fields . Although the scale of coverage has expanded , But it still can't cover all object scales , Because the expansion process multiplies all original coverage scales by greater than 1 Factor of . We illustrate the graph 4(b) Situation in , And graph 4(a) comparison , The whole scale range is transferred to a larger scale . then , We combine the original scale range with the expanded scale range by adding corresponding features , Thus, output features with multiple receptive fields covering all object scales are generated , Pictured 4. Through... In the middle 3×3 A residual block with expansion is constructed on the convolution layer , This can be easily done .
1） Expansion encoder
Based on the above design , The author puts forward SiSo Encoder , Name it Dilated Encoder, Here's the picture 5 Shown . It contains two main components ：Projector and Residual block . The projection layer first applies a 1×1 Convolution layer to reduce channel dimension , And then add a 3×3 Convolution layer to refine semantic context , This is related to FPN The same in . after , The author in 3 × 3 In the convolution layer Stack four with different expansion rates Continuous expansion residual block , To generate output features with multiple receptive fields , Covers the dimensions of all objects .
Extended convolution is a common strategy to expand the feature receptive field in target detection . such as ,TridentNet Extended convolution is used to generate multiscale features . It deals with the problem of scale change in target detection through multi branch structure and weight sharing mechanism , This is different from our single-stage feature setting .
also ,Dilated Encoder Stack the expanded residual blocks one by one , No weight sharing . Even though DetNet The extended residual block is also applied successively , However, the purpose is to maintain the spatial resolution of the feature and retain more details in the output of the trunk , Our goal is to generate features with multiple receptive fields outside the trunk . Dilated Encoder Our design enables us to detect all objects on single-level features , Not like it TridentNet and DetNet And so on .
Positive Anchors The definition of is very important for the optimization problem in target detection . In an anchor based detector , Definition Positive Anchors Mainly by measuring the distance between the anchor and the real frame IOU To determine the . stay RetinaNet in , If the maximum of the anchor and the real box IoU Greater than threshold 0.5, The anchor will be set to positive , We call it Max-IoU matching .
stay MiMo Encoding ,anchors Predefined at multiple levels in dense paving ,ground-truth The frame is generated at the feature level corresponding to its scale Positive Anchors. Given the divide and rule mechanism ,Max-IoU Matching enables the real box in each scale to generate a sufficient number of Positive Anchors. However , When we use SiSo When encoder , The number of anchors is the same as MiMo Compared with the in the encoder, it greatly reduces , from 100k To 5k, Resulting in sparse anchor points . In the application Max-IoU When the match , Sparse anchors can cause detector matching problems , Pictured 6 Shown . In nature , Large truth boxes induce more... Than small truth boxes Positive Anchors, This can lead to Positive Anchors The imbalance of . This imbalance makes the detector focus on the large truth box and ignore the small one .
1） Uniform matching
In order to solve this imbalance in the positive anchor , We propose a uniform matching strategy ： use k The nearest anchor serves as the positive anchor for each truth box , This ensures that all truth boxes can be evenly matched with the same number of positive anchors , Regardless of their size （ chart 6）. The balance in the positive sample ensures that all truth boxes participate in the training and make equal contributions . Besides , Study Max-IoU matching , The author in Uniform Matching Set in IoU Threshold to ignore large IoU (>0.7) Negative anchor and small IoU (<0.15) Positive anchor .
2） Discuss ： Relationship with other matching methods
Apply... In the matching process topk Not new . ATSS First, in the L Select... For each ground live box at a feature level topk anchor , Then through dynamic IoU Threshold at k × L Sampling positive anchor in candidate . However ,TSS It focuses on defining positive and negative examples adaptively , Our uniform matching focuses on achieving balance on positive samples with sparse anchors . Although the previous methods have achieved balance on positive samples , But their matching process is not designed for this imbalance . for example ,YOLO and YOLOv2 Match the ground live box to the best match cell or anchor ; DETR Hungarian algorithm is used for matching . These matching methods can be regarded as top1 matching , This is a special case of our unified matching . what's more , The difference between unified matching and learning matching is ： Learn matching methods , Such as FreeAnchor and PAA, According to the learning status , And uniform matching is fixed , It doesn't develop with training . Uniform matching is to solve SiSo It is proposed to design the specific imbalance problem of the lower positive anchor point
Based on the above improvement scheme , This paper presents a fast and direct framework with single-level characteristics YOLOF. This article will YOLOF The network architecture is divided into three parts ： The trunk 、 Encoder and decoder . As shown in the figure below ：
In all models , This paper adopts ResNet and ResNeXt Series as backbone network . All the models are in ImageNet I did pre training on . The output of the trunk is C5 Characteristics of figure , It has 2048 Channels , The lower sampling rate is 32. For fair comparison with other detectors , All in the trunk BN Layers are frozen by default .
For encoders （ chart 5）, First, by adding two projection layers after the backbone network （ One 1 × 1 And a 3 × 3 Convolution ） To follow FPN, So as to produce 512 Characteristic graph of channels . then , In order to make the output characteristics of the encoder cover all targets on various scales , We recommend adding residual blocks , It consists of three consecutive convolutions ： first 1 × 1 The reduction rate of convolution application is 4 The number of channels is reduced , And then a 3 × 3 Convolution and expansion are used to expand the receptive field , Last ,1 × 1 Number of convolution recovery channels .
For the decoder , We use RetinaNet Main design of , It consists of two parallel specific task headers ： Classification header and regression header . We only added two small changes . The first is that we use DETR in FFN The design of the , Let two head The number of convolution layers is different . There are four convolutions on the regression head , Followed by batch normalization layer and ReLU layer , And there are only two on the classification head . The second is that we follow Autoassign And add an implicit object prediction for each anchor on the regression head （ No direct supervision ）. The final classification score of all predictions is generated by multiplying the classification output by the corresponding implicit object .
The author in MS COCO On a benchmark YOLOF, And with RetinaNet and DETR Compare , then , Detailed ablation studies and quantitative results and analysis for each component design are provided . Last , In order to provide insights for the further study of single-stage detection , The author provides error analysis and shows YOLOF And DETR Weaknesses compared to .
YOLOF stay 8 individual GPU Synchronous use on SGD Training , Every mini-batch in total 64 Zhang image （ Every GPU 8 Zhang image ）. All models are based on 0.12 The initial learning rate for training . Besides , stay DETR after , The author sets a small learning rate for the backbone network , That is, the basic learning rate 1/3. To stabilize the initial training , The author changes the number of warm-up iterations from 500 Extend to 1500 Time . For training programs , With increasing batch size ,YOLOF Medium “1×” The total number of schedule settings is 22.5k Sub iteration , And has a cardinality in 15k and 20k In iteration , The learning rate has decreased 10 times . Other schedules are based on Detectron2 Adjust the principles in . For model reasoning , Use threshold is 0.6 Of NMS Post process the results . For other super parameters , follow RetinaNet Set up .
And RetinaNet contrast ： With the help of multi-scale testing , We got 47.1 mAP The final result and small objects 31.8 mAP Competitive performance .
And DETR contrast ： YOLOF Performance on small objects is better than DETR, And lag behind... On large objects DETR. what's more , And DETR comparison ,YOLOF Converge faster （ about 7 times ）, Make it better than DETR It is more suitable as a simple baseline for single-stage detector .
And YOLOv4 contrast ： YOLOF-DC5 Running speed ratio of YOLOv4 fast 13%, Overall performance improved 0.8 mAP. YOLOF-DC5 The results on small objects are not as good as YOLOv4（24.0 mAP Yes 26.7 mAP）, But it performs much better on large objects （7.1 mAP）.
Firstly, the author makes a comprehensive analysis of the two proposed components . then , Ablation experiments showing the detailed design of each component .
Use ResNet-50 The effect of expanding encoder and uniform matching ： These two components improve the performance of the original single-stage detector 16.6 mAP
The above table shows that both expansion encoder and uniform matching are YOLOF Necessary , And brought considerable improvements . say concretely ,Dilated Encoder Have a significant impact on big goals （43.8 vs. 53.2）, The results for small and medium-sized objects are slightly improved .
It turns out that , The limited scale range is C5 features （ The first 4.1 section ） A serious problem in . What this article puts forward Dilated Encoder It provides a simple but effective solution to this problem . On the other hand , Without a uniform match , The performance of small and medium-sized targets decreased significantly （%10AP）, The performance of large objects is only slightly affected .
More ablation experiments can be seen in the table below ：
In short , The main contribution of this paper is ：
- Pointed out that FPN The most significant advantage is its divide and conquer solution to the optimization problem in dense target detection , Instead of multi-scale feature fusion .
- One went out FPN A simple and effective baseline YOLOF. stay YOLOF in , Two key components are proposed ,Dilated Encoder and Uniform Matching, It's healed SiSo Encoder and MiMo The performance gap between encoders .
- stay COCO A large number of experiments have been carried out on the benchmark , Shows the importance of each component . Besides ,YOLOF And RetinaNet、DETR and YOLOv4 Made a comparison , Experiments show that YOLOF Can be in GPU Get comparable results faster on .
- [partage de l'expérience de travail], Tencent
- Prenez ces notes techniques du cadre et assurez - vous de les lire correctement.
- [original openim] C / C + + calls golang function, and golang calls back C / C + + function
- Business and data - building macro awareness
- C - for loop and loop nesting
- Terrain remote state storage
- 多亏这份《秋招 金九银十-腾讯面试题合集》跳槽薪资翻倍，音视频开发面试
- Avez - vous suivi tous ces C / C + + secs?
- Compréhension simple de l'index mongodb
Qiming cloud sharing: detailed steps of esp32-c3 environment preparation ②
Download addresses of various versions of visual studio from the Internet, VS2010 / vs2012 / vs2013, with registration code
Dagger2 of dependency injection
JS compression tool summary
Huayun said that Huayun data enterprises develop test platform solutions
Grâce à cette collection de questions d'entrevue d'automne, le salaire de saut d'emploi et l'entrevue de développement audio et vidéo ont doublé.
Ali p7 a passé l'interview des étudiants de premier cycle.
Calculer l'indice Hurst
Transactions d'ee entre plateformes: gestion de fonds
- Prévision des tendances du marché à l'aide de la classification bayésienne et de l'analyse du spectre singulier
- Interface graphique X: tri, reconstruction des contrôles dans les tables et les cellules (compilation intégrée 11)
- Architecture and data science behind hundreds of thousands of experiments a year
- Développer des indicateurs personnalisés en utilisant la classe ccanvas
- Formes disponibles lors de la négociation de paniers de devises partie 2
- Indicateurs personnalisés et graphiques d'information utilisant le ccanvas
- Tendances générales avec interface graphique
- Méthode de tri et visualisation avec mql5
- 1ms delay, 10Gbps rate... Interpretation of 5g communication technology
- Flying oar China walks into Chengdu to talk with local enterprises about Intelligent Manufacturing Upgrading
- Wechat payment V3 development tutorial (I): getting to know senparc.weixin.tenpayv3
- Chaos engineering practice of Haojing technology based on chaosblade
- Narrative visualization and data video
- Deep learning - two ways to load training data
- Wechat payment V3 development tutorial (I): getting to know senparc.weixin.tenpayv3
- Écrivez un logiciel de prononciation pour mémoriser les mots - - série obligatoire de niveaux 4 et 6
- 大厂Android面试总结 详细解答，再不刷题就晚了
- 大厂Android面试总结 详细解答，Android技术图谱
- Garde la distance.
- 90 lines of code to implement the module packer
- When the OrCAD schematic is opened, suddenly the schematic file cannot be found
- Purpose and difference between Maitreya clamp and active clamp of IGBT
- [PHP source code] z-blogphp pirate navigation theme template
- Architecture and data science behind hundreds of thousands of experiments a year
- 大廠Android面試總結 詳細解答，Android技術圖譜
- Résumé de l'entrevue Android de Dachang, carte technique Android
- Le résumé de l'entrevue Android de Dachang est en retard
- Agent de cache SQUID
- 6 - year New Bird Development interview Byte Jumping Android R & D Post, Dry goods collation
- Notes d'étude de HongMeng, 714 pages PDF, or, neuf, argent et dix
- L'offre de la grande usine est douce. Incroyable.
- L'intervieweur demande toujours le mode adaptateur et le mode d'apparence.
- Où est la sortie professionnelle pour les programmeurs d'usine après 35 ans?